Intel-Wired-Lan Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Lifshits, Vitaly <vitaly.lifshits@intel.com>
To: intel-wired-lan@osuosl.org
Subject: [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
Date: Wed, 4 Sep 2019 15:31:04 +0300	[thread overview]
Message-ID: <2ea7240f-f0f3-8b43-bf97-6c2a1a3f2a66@intel.com> (raw)
In-Reply-To: <53d81b8c69ddeba6f76128f308ff5275@mirality.co.nz>

On 9/4/2019 14:08, Gavin Lambert wrote:
> On 2019-09-04 22:06, Winkler, Tomas wrote:
>>>
>>> On 2019-09-03 21:39, Paul Menzel wrote:
>>> > Dear Tomas,
>>> >
>>> > On 2019-09-03 11:28, Winkler, Tomas wrote:
>>> >
>>> >>> On Tue, Sep 03, 2019 at 10:35:30AM +0200, Paul Menzel wrote:
>>> >
>>> >>>> On 03.09.19 09:56, Gavin Lambert wrote:
>>> >>>>> On 2019-08-20 14:15, I wrote:
>>> >>>>>> Does anyone have any ideas about this?? Either towards further
>>> >>>>>> investigation or to a possible resolution?
>>> >>>>>>
>>> >>>>>> This is at the point of hardware internals now, so I have no 
>>> idea
>>> >>>>>> how to proceed in either area.
>>> >>>>>
>>> >>>>> To recap (plus some new info):
>>> >>>>>
>>> >>>>> 1. I am using a kernel module which uses the code from the e1000e
>>> >>>>> driver to communicate with the hardware without actually
>>> >>>>> registering it as a Linux netdev.? (This is partly because it can
>>> >>>>> get used in a Xenomai context outside of Linux itself, although
>>> >>>>> I'm not doing that
>>> >>>>> myself.) This historically works fine.
>>> >>>>>
>>> >>>>> 2. On certain Linux versions, I encountered an issue where
>>> >>>>> disconnecting the network cable and reconnecting it almost always
>>> >>>>> results in not being able to send any packets.? (I cannot
>>> >>>>> determine if receiving packets works in this case, as the network
>>> >>>>> design will not receive packets unless some are sent first.)
>>> >>>>> Restarting the driver (rmmod+modprobe) does recover from this 
>>> case
>>> >>>>> (until the next link loss), but simply replugging the cable 
>>> never does.
>>> >>>>>
>>> >>>>> 3. The problem was observed with both I219-V and I219-LM (on
>>> >>>>> motherboard), but was *not* observed with 82571EB (PCIE).? The
>>> >>>>> problem was not observed with a motherboard igb-based I211.? I
>>> >>>>> suspect the issue is limited to motherboard-based e1000e 
>>> adapters.
>>> >>>>> (Or perhaps there's something different about how the IGBs are
>>> >>>>> internally connected.)
>>> >>>>>
>>> >>>>> 4. The problem does not occur when the e1000e driver is 
>>> registered
>>> >>>>> "normally" as a Linux netdev.
>>> >>>>>
>>> >>>>> 5. The problem was introduced by "mei: me: allow runtime pm for
>>> >>>>> platform with D0i3" (which has been backported to 4.4+, as far as
>>> >>>>> I can tell).
>>> >>>>> Excluding this commit reliably resolves the issue and 
>>> including it
>>> >>>>> reliably breaks it.
>>> >>>>
>>> >>>> The commit hash in the master branch is
>>> >>>> cc365dcf0e56271bedf3de95f88922abe248e951 and is there since
>>> >>>> v4.16-rc1.
>>> >>>>
>>> >>>> Strange, that it is in 4.4 and 4.9, as it was only tagged for
>>> >>>> v4.13+.
>>> >>>>
>>> >>>>> 6. Applying the previously suggested patch
>>> >>>>> 
>>> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b62a05712f35a7fa5f5e56
>>> >>>>> has no effect; the E1000_STATUS_PCIM_STATE
>>> >>>>> bit is not set when the issue occurs.
>>> >>>>>
>>> >>>>> 7. Given the content of the change in #5, I assumed that the
>>> >>>>> problem was power-management related, perhaps a side effect of 
>>> the
>>> >>>>> e1000e driver not being registered as a netdev.? (So perhaps
>>> >>>>> something thinks that no devices are in use and turns something
>>> >>>>> off?)
>>> >>>>>
>>> >>>>> 8. I've previously posted register dumps from an e1000e in both
>>> >>>>> the "normal" and "link up but not transmitting" states.? They
>>> >>>>> seemed very similar, but as I'm not familiar with the register
>>> >>>>> meanings I may have overlooked something significant.? (Note that
>>> >>>>> the dumps were captured inside the watchdog task, when it detects
>>> >>>>> link up but before it sets
>>> >>>>> E1000_TCTL_EN.)
>>> >>>>>
>>> >>>>> 9. I enabled debug logging in the mei driver; it logs a couple of
>>> >>>>> runtime_idles and then a runtime_suspend during system startup.
>>> >>>>> (I added a log to runtime_resume that is missing in the driver
>>> >>>>> source, but it appears this does not get called in my scenario.)
>>> >>>>> Note that the e1000e driver is still working ok after this.. at
>>> >>>>> least at first.
>>> >>>>>
>>> >>>>> 10. "cat
>>> >>>>> /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
>>> >>>>> => "suspended"
>>> >>>>>? ??? "cat
>>> >>>>>
>>> >>>
>>> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
>>> >>>>> => "unsupported"
>>> >>>>>? ??? "cat
>>> >>>>> /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
>>> >>>>> => "active"
>>> >>>>>? ??? "cat
>>> >>>>> /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
>>> >>>>> => "active" (this is the actual NIC)
>>> >>>>>? ??? These don't change between the working and non-working 
>>> states.
>>> >>>>> (It's possible that some other device does, but I haven't 
>>> found it
>>> >>>>> yet.)
>>> >>>>>
>>> >>>>> 11. I did try forcing the above to unsuspend, but this did not
>>> >>>>> recover from the e1000e issue.
>>> >>>>>
>>> >>>>> 12. I also tried calling e1000e_reset on link-down.? This 
>>> produces
>>> >>>>> different register output on link-up, but doesn't recover from 
>>> the
>>> >>>>> issue.
>>> >>>>>
>>> >>>>> 13. I also tried recompiling the kernel with CONFIG_PM disabled
>>> >>>>> (no power management).? This *does* resolve the problem (but is a
>>> >>>>> very big hammer).
>>> >>>>>
>>> >>>>> 14. Possibly also of interest is that if I do *both* #12 and #13,
>>> >>>>> the problem remains (suggesting #12 was counter-productive).
>>> >>>>>
>>> >>>>> FYI the hardware on one of the test machines is as follows:
>>> >>>>>? ??? 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
>>> >>>>>? ??? 00:01.0 PCI bridge: Intel Corporation Skylake PCIe 
>>> Controller
>>> >>>>> (x16) (rev 05)
>>> >>>>>? ??? 00:02.0 VGA compatible controller: Intel Corporation Device
>>> >>>>> 5912 (rev 04)
>>> >>>>>? ??? 00:08.0 System peripheral: Intel Corporation Skylake 
>>> Gaussian
>>> >>>>> Mixture Model
>>> >>>>>? ??? 00:14.0 USB controller: Intel Corporation Sunrise Point-H 
>>> USB
>>> >>>>> 3.0? xHCI Controller (rev 31)
>>> >>>>>? ??? 00:14.2 Signal processing controller: Intel Corporation
>>> >>>>> Sunrise Point-H Thermal subsystem (rev 31)
>>> >>>>>? ??? 00:15.0 Signal processing controller: Intel Corporation
>>> >>>>> Sunrise Point-H Serial IO I2C Controller #0 (rev 31)
>>> >>>>>? ??? 00:15.1 Signal processing controller: Intel Corporation
>>> >>>>> Sunrise Point-H Serial IO I2C Controller #1 (rev 31)
>>> >>>>>? ??? 00:16.0 Communication controller: Intel Corporation Sunrise
>>> >>>>> Point-H CSME HECI #1 (rev 31)
>>> >>>>>? ??? 00:17.0 SATA controller: Intel Corporation Sunrise Point-H
>>> >>>>> SATA controller [AHCI mode] (rev 31)
>>> >>>>>? ??? 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>> >>>>> Root Port #19 (rev f1)
>>> >>>>>? ??? 00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>> >>>>> Root Port #20 (rev f1)
>>> >>>>>? ??? 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>> >>>>> Express Root Port #5 (rev f1)
>>> >>>>>? ??? 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>> >>>>> Express Root Port #11 (rev f1)
>>> >>>>>? ??? 00:1e.0 Signal processing controller: Intel Corporation
>>> >>>>> Sunrise Point-H Serial IO UART #0 (rev 31)
>>> >>>>>? ??? 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC
>>> >>>>> Controller (rev 31)
>>> >>>>>? ??? 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H
>>> >>>>> PMC (rev 31)
>>> >>>>>? ??? 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev
>>> >>>>> 31)
>>> >>>>>? ??? 00:1f.6 Ethernet controller: Intel Corporation Ethernet
>>> >>>>> Connection (2) I219-LM (rev 31)
>>> >>>>>? ??? 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>> >>>>> Network Connection (rev 03)
>>> >>>>>? ??? 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>> >>>>> Network Connection (rev 03)
>>> >>>>>? ??? 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>> >>>>> Network Connection (rev 03)
>>> >
>>> > (Tomas, your MUA wrapped the lines messing up the formatting.)
>>
>>
>> Sorry, it's outlook.
>>
>>> >
>>> >>>>> I'm happy to add any code instrumentation or make any other
>>> >>>>> changes needed to locate and resolve the problem, and I can
>>> >>>>> readily reproduce it
>>> >>>>> -- I'm just at a complete loss as to where to start looking, and
>>> >>>>> am still hoping for some suggestions in that regard.
>>> >>>>>
>>> >>>>> If there's anywhere (or anyone) else better for me to talk to
>>> >>>>> about this issue, please let me know that too.
>>> >>>>
>>> >>>> It is not clear to me, if this is still reproducible on Linux
>>> >>>> 5.3-rc7 (or Linus? master branch).
>>> >>>>
>>> >>>> If it is, this is a definitely regression, and the commits need to
>>> >>>> be reverted due to Linux? no regression policy.
>>> >>>
>>> >>> So I should revert this from 4.4.y and 4.9.y?
>>> >>
>>> >> The issue is not in mei driver, it is in e1000 driver, I my best
>>> >> knowledge there should be fix, please Vitaly can it be backported to
>>> >> older kernels?
>>> >
>>> > Tomas, backporting the commit supposedly fixing this, does *not* 
>>> help.
>>
>> I hope that Vitaly can address that.

As far as I can see it's not the same issue we had in the upstream 
driver when the mei commit was added.

Backporting this commit is not possible and will not help.

>>
>>> > Also, it does not matter for the no regression policy.
>>
>> There are power consumption implication if you revert this commit for
>> everyone, while the issue is present only on some platforms.
>
> I wouldn't suggest reverting that change, at least not solely on my 
> account (unless it's affecting more people).? It's not only me using 
> this code but it's still a very niche case, and outside of "normal" 
> Linux usage.
>
> Although it seems a little odd that it ended up in 4.4 and 4.9 when 
> the commit said it was intended for 4.13+.? But I don't know how those 
> things work.
>
> (Though in a way this was good for me -- it would have been a lot 
> harder to run into this issue when switching from 4.9 to 4.19 [which 
> would have been the next step] rather than from 4.9.110 to 4.9.168 
> [which is what actually happened].)
>
>> You can still disable runtime power management via sysfs and
>> permanently using udev rule on your particular system.
>> e.g. ATTR{../../power/control}="on"
>
> I'll do some more testing on this tomorrow, but I do recall trying 
> setting power/control to "on" (via sysfs) for the device:
>
> ? 00:16.0 Communication controller: Intel Corporation Sunrise Point-H 
> CSME HECI #1 (rev 31)
>
> which was the one that I noticed was suspended.? Is this the mei device?
>
> In any case when I tried it before it didn't seem to help, but I think 
> this was after link-down and things had already failed. I'll try 
> testing a few more cases, including doing it pre-emptively.

I suggest testing this on kernel 5.e-rc7 as Paul advised. As the bug 
wasn't reproduced on the kernel .

>
>>> > Let?s wait until Gavin can confirm if it is happening with Linux
>>> > 5.3-rc7.
>>>
>>> As noted above (and in a prior email), the problem doesn't occur 
>>> when using
>>> the driver "normally" within Linux.? The triggering environment is 
>>> where the
>>> driver init/send/receive code is being executed directly
>>> *without* being registered as a Linux netdev.
>>>
>>> It is likely that the "real problem" is some side effect of this, 
>>> such as
>>> something checking if a child device is in use or powered down but 
>>> it's not
>>> registered.
>>>
>>> My environment is currently based on this tree:
>>>
>>> > Using this kernel tree:
>>> >
>>> > 
>>> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git/log/?h=v4.9-rt&ofs=3120
>>> >
>>> > I've identified that the code at tag v4.9.126 is "good" and the code
>>> > at tag v4.9.127 is "bad".
>>> (I then narrowed it down to that specific commit.)
>>>
>>> To reiterate, there is probably no problem with standard usage of the
>>> drivers as part of Linux.
>>>
>>> But in this particular non-standard-edge-case-usage, there seems to 
>>> be some
>>> unfortunate interaction between the mei driver power management change
>>> and link-loss in onboard e1000e, and I'm trying to figure out the 
>>> cause and
>>> hopefully a fix/workaround (or at least one less serious than 
>>> disabling power
>>> management entirely).
>> This is some underlying issue, I'm don't think you can be able to
>> resolve it yourself,? e1000 guys should provide the fix.
>> Unfortunately I cannot really fix this issue form the mei side.
>>
>>>
>>> Some more context from my original email:
>>> > I'm using a system with an e1000e network driver which has been
>>> > patched to bypass the regular Linux network stack (because it can get
>>> > called from a Xenomai RT context, among other reasons -- although in
>>> > my case I'm not doing that).? The complete source for the patched
>>> > version of the code can be found here:
>>> >
>>> > 
>>> https://github.com/ribalda/ethercat/blob/master/devices/e1000e/netdev-4.9-ethercat.c
>>> > (There are some minor changes to other files, but the
>>> > majority of changes are only to this file.? You can see just the
>>> > changes at
>>> > 
>>> https://gist.github.com/uecasm/5e36a15bda6ffd53079344fc443dcc5f/revisions 
>>> .)
>>> >
>>> > It was originally based on the in-kernel e1000e driver as of Linux
>>> > 4.9.65.? (I'm not the person who originally made the patches, but 
>>> I am
>>> > the person who rebased them to kernel 4.9 and I'm the one trying to
>>> > maintain them for newer kernel versions.? Though I'm also not the
>>> > person who made that github repo.)
>>
>> You will need to eventually incorporate the e1000 fix when resolved
>> also to your code base.
>> For now the easiest workaround is to disable power management on mei
>> from outside on effected platforms.
>
> Yeah, I'm hoping that the eventual solution will be a code change to 
> the e1000e driver.? The way the distribution is structured it's very 
> easy to apply a fix there and much much harder to apply one at any 
> other point.? Though userspace rule changes are also feasible.

Please try our OOT driver which can be found in:

https://sourceforge.net/projects/e1000/files/e1000e%20stable/3.5.1/

Also please open a ticket for this issue in this source forge page.


  reply	other threads:[~2019-09-04 12:31 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-11  6:50 [Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver Gavin Lambert
2019-07-12  3:23 ` Gavin Lambert
2019-07-18  8:06   ` Gavin Lambert
2019-07-18  8:22     ` Paul Menzel
2019-07-18  8:24     ` Neftin, Sasha
2019-07-19  0:40       ` Gavin Lambert
2019-07-19  1:02         ` Gavin Lambert
2019-08-20  2:15           ` Gavin Lambert
2019-09-03  7:56             ` Gavin Lambert
2019-09-03  8:35               ` Paul Menzel
2019-09-03  9:20                 ` Greg Kroah-Hartman
2019-09-03  9:28                   ` Winkler, Tomas
2019-09-03  9:39                     ` Paul Menzel
2019-09-03 11:00                       ` Gavin Lambert
2019-09-04 10:06                         ` Winkler, Tomas
2019-09-04 11:08                           ` Gavin Lambert
2019-09-04 12:31                             ` Lifshits, Vitaly [this message]
2019-09-05  3:59                             ` Gavin Lambert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2ea7240f-f0f3-8b43-bf97-6c2a1a3f2a66@intel.com \
    --to=vitaly.lifshits@intel.com \
    --cc=intel-wired-lan@osuosl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox