Help understanding a possible timing issue in PCIe link training?

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Help understanding a possible timing issue in PCIe link training?
@ 2014-07-14  5:41 Kyle Auble
  2014-07-15 18:36 ` Bjorn Helgaas
  0 siblings, 1 reply; 2+ messages in thread
From: Kyle Auble @ 2014-07-14  5:41 UTC (permalink / raw)
  To: linux-pci

Hello, I wanted to keep this email short, but my questions are all
interconnected. My GPU is an on-board Nvidia GeForce 8400M GT
(pci id [10de:0426]), and since at least kernel v3.2, the generic x86
kernel only loads the device 1 in 10 times. This is still true as of
v3.16-rc3. Honestly, it's probably something that the BIOS should
prevent, but I've checked and there are no relevant options or upgrades
for my BIOS (on a Sony Vaio VGN-FZ260E).

I've been tracking this problem at launchpad.net on-and-off for a
couple years now, but I don't think it's a common issue, and I have
some free time to try resolving it myself now. I'm new to system
programming though so I was wondering: Does the issue I'm seeing fit a
pattern of some kind? Can someone help me understand how the symptoms
fit together and where they come from? Or if I need to do more
analysis, what would probably be the best approach?

1.The key thing I discovered is that whenever the GPU does load, a ~6ms
gap appears in the dmesg logs during the GPU's pci initialization. When
the GPU fails to load though, this gap grows to 30ms. Also, I've
pinpointed the delay (with dev_info statements) to:
pcie_aspm_configure_common_clock in drivers/pci/pcie/aspm.c

After some googling, I came across powerpoints from the PCI-SIG
organization that mention 24ms as precisely the PCIe specified timeout
for some states of link training, and sure enough, this function tells
the bridge upstream of the GPU to retrain the link. However, even when
the GPU fails to load and 30ms is spent in the function, the dev_err
towards the end of the function doesn't print.

2.Now the first reason I'm pretty certain that this isn't strictly a
hardware issue beyond recovery is that there's a workaround. If I make
sure my computer is running off of the battery, without AC power, for
that first second of kernel initialization, the GPU always loads. I've
tried this dozens of times. I don't clearly understand why, but I've
read that the power-saving link states do correspond to distinct states
in the link-training state machine.

3.The next fact (that I have no explanation for) is that the situation
reverses almost exactly on the amd64 kernel. The 64-bit kernel boots
the GPU fine 9 times out of 10, but there is still the occasional
session where the 30ms gap appears and the GPU never loads.

4.To keep things simple, I also tried inserting dev_info statements
within the different branches of pcie_aspm_configure_common_clock, but
this made the problem disappear (and there was only a 6ms gap). I tried
once more with fewer statements to reduce overhead, which did increase
the time gap to 11ms but still allowed the GPU to load. The idea that
more overhead in the function affects timing makes sense to me, but
that it decreases time spent in the function is counter-intuitive.

5.Finally, before I started looking through the code, I tried some git
bisections because there was a brief time in summer of 2013 where the
problem went away. The commit that resolved it turned out to be:
d34883d4e35c0a994e91dd847a82b4c9e0c31d83 by Xiao Guangrong
After the problem returned, I tried another bisection, but wound up
doing a manual bisection instead of using git bisect (I honestly don't
remember why). The commit I found that reintroduced the problem was:
ee8209fd026b074bb8eb75bece516a338a281b1b by Andy Shevchenko

What stumps me is that neither of these commits appears directly
related to the pci subsystem. Because it wasn't a normal bisection that
returned Andy's commit and I didn't test that build as much, I still
wonder if it's a false positive. However, I've tested a kernel built at
Xiao's commit many times so I'm confident it resolved the issue, though
my hypothesis is that it's purely by a subtle side effect of how the
raw assembly is loaded into memory at startup.

Again, I apologize for the length, but I'd be grateful for any advice.
I'm not registered on the mailing list so I would appreciate being
CC'ed in any replies. I don't plan on becoming a regular kernel hacker
anytime soon, just want to do my tiny part to help.

Sincerely,
Kyle Auble

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Help understanding a possible timing issue in PCIe link training?
  2014-07-14  5:41 Help understanding a possible timing issue in PCIe link training? Kyle Auble
@ 2014-07-15 18:36 ` Bjorn Helgaas
  0 siblings, 0 replies; 2+ messages in thread
From: Bjorn Helgaas @ 2014-07-15 18:36 UTC (permalink / raw)
  To: Kyle Auble; +Cc: linux-pci@vger.kernel.org

On Sun, Jul 13, 2014 at 11:41 PM, Kyle Auble <kyle.auble@zoho.com> wrote:
> Hello, I wanted to keep this email short, but my questions are all
> interconnected. My GPU is an on-board Nvidia GeForce 8400M GT
> (pci id [10de:0426]), and since at least kernel v3.2, the generic x86
> kernel only loads the device 1 in 10 times. This is still true as of
> v3.16-rc3. Honestly, it's probably something that the BIOS should
> prevent, but I've checked and there are no relevant options or upgrades
> for my BIOS (on a Sony Vaio VGN-FZ260E).
>
> I've been tracking this problem at launchpad.net on-and-off for a
> couple years now,

This is https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1009312, right?

Please collect the following information from two boots of the newest
kernel you have: the first on battery power where the GPU works fine,
and the second on AC power where the GPU driver fails to load:

  - complete dmesg log
  - "lspci -vvxxx" output for the whole system

In addition, please collect an acpidump (this will be the same either
way, so you only need one copy).

I see that in https://launchpadlibrarian.net/106978987/baddmesg.log,
you were using the proprietary nvidia driver.  If the problem is that
that driver isn't loading, there isn't much we can do because it's
closed-source.  But if you're seeing a problem before loading that
driver, there might be something we can fix.

Bjorn

> but I don't think it's a common issue, and I have
> some free time to try resolving it myself now. I'm new to system
> programming though so I was wondering: Does the issue I'm seeing fit a
> pattern of some kind? Can someone help me understand how the symptoms
> fit together and where they come from? Or if I need to do more
> analysis, what would probably be the best approach?
>
> 1.The key thing I discovered is that whenever the GPU does load, a ~6ms
> gap appears in the dmesg logs during the GPU's pci initialization. When
> the GPU fails to load though, this gap grows to 30ms. Also, I've
> pinpointed the delay (with dev_info statements) to:
> pcie_aspm_configure_common_clock in drivers/pci/pcie/aspm.c
>
> After some googling, I came across powerpoints from the PCI-SIG
> organization that mention 24ms as precisely the PCIe specified timeout
> for some states of link training, and sure enough, this function tells
> the bridge upstream of the GPU to retrain the link. However, even when
> the GPU fails to load and 30ms is spent in the function, the dev_err
> towards the end of the function doesn't print.
>
> 2.Now the first reason I'm pretty certain that this isn't strictly a
> hardware issue beyond recovery is that there's a workaround. If I make
> sure my computer is running off of the battery, without AC power, for
> that first second of kernel initialization, the GPU always loads. I've
> tried this dozens of times. I don't clearly understand why, but I've
> read that the power-saving link states do correspond to distinct states
> in the link-training state machine.
>
> 3.The next fact (that I have no explanation for) is that the situation
> reverses almost exactly on the amd64 kernel. The 64-bit kernel boots
> the GPU fine 9 times out of 10, but there is still the occasional
> session where the 30ms gap appears and the GPU never loads.
>
> 4.To keep things simple, I also tried inserting dev_info statements
> within the different branches of pcie_aspm_configure_common_clock, but
> this made the problem disappear (and there was only a 6ms gap). I tried
> once more with fewer statements to reduce overhead, which did increase
> the time gap to 11ms but still allowed the GPU to load. The idea that
> more overhead in the function affects timing makes sense to me, but
> that it decreases time spent in the function is counter-intuitive.
>
> 5.Finally, before I started looking through the code, I tried some git
> bisections because there was a brief time in summer of 2013 where the
> problem went away. The commit that resolved it turned out to be:
> d34883d4e35c0a994e91dd847a82b4c9e0c31d83 by Xiao Guangrong
> After the problem returned, I tried another bisection, but wound up
> doing a manual bisection instead of using git bisect (I honestly don't
> remember why). The commit I found that reintroduced the problem was:
> ee8209fd026b074bb8eb75bece516a338a281b1b by Andy Shevchenko
>
> What stumps me is that neither of these commits appears directly
> related to the pci subsystem. Because it wasn't a normal bisection that
> returned Andy's commit and I didn't test that build as much, I still
> wonder if it's a false positive. However, I've tested a kernel built at
> Xiao's commit many times so I'm confident it resolved the issue, though
> my hypothesis is that it's purely by a subtle side effect of how the
> raw assembly is loaded into memory at startup.
>
> Again, I apologize for the length, but I'd be grateful for any advice.
> I'm not registered on the mailing list so I would appreciate being
> CC'ed in any replies. I don't plan on becoming a regular kernel hacker
> anytime soon, just want to do my tiny part to help.
>
> Sincerely,
> Kyle Auble
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-07-15 18:36 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-14  5:41 Help understanding a possible timing issue in PCIe link training? Kyle Auble
2014-07-15 18:36 ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).