Re: [bisected] tg3 broken in 3.18.0?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [bisected] tg3 broken in 3.18.0?
@ 2014-12-13 21:02 Nils Holland
  2014-12-15 15:06 ` Marcelo Ricardo Leitner
  2014-12-16  0:31 ` Bjorn Helgaas
  0 siblings, 2 replies; 24+ messages in thread
From: Nils Holland @ 2014-12-13 21:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-pci, rajatxjain

rajatxjain@gmail.com
Bcc: 
Subject: Re: [bisected] tg3 broken in 3.18.0?
Reply-To: 
In-Reply-To: <20141212.201831.186234837340644301.davem@davemloft.net>

On Fri, Dec 12, 2014 at 08:18:31PM -0500, David Miller wrote:
> From: Nils Holland <nholland@tisys.org>
> Date: Sat, 13 Dec 2014 02:14:08 +0100
> 
> > 
> > My bisect exercise suggests that the following commit is the culprit:
> > 
> > 89665a6a71408796565bfd29cfa6a7877b17a667 (PCI: Check only the Vendor
> > ID to identify Configuration Request Retry)
> 
> You definitely need to bring this up with the author of that change
> and the relevent list for the PCI subsystem and/or linux-kernel.

I've now already sent an inquiry to Rajat Jain, the author of the
patch in question, and this message here is now also CC'd to
linux-pci@.

With this message, I'd like to add one last result of investigation
I've done today, in the hope that it will aid the folks with more
knowledge to go after the issue.

Basically, I've added a little debug output to tg3.c in the function
tg3_poll_fw(), as that function contained the code that would print
out the "No firmware running" line that was visible in dmesg on those
kernels where tg3 would not work for me. So, I basically had this:

static int tg3_poll_fw(struct tg3 *tp)
{
        int i;
        u32 val;

        netdev_info(tp->dev, "XX: Boom!\n");
        [...]
}

Now, I was looking through dmesg searching for occurances of this
debug output, using a standard 3.18.0 kernel (where my tg3 doesn't
work) as well as using a 3.18.0 kernel with
89665a6a71408796565bfd29cfa6a7877b17a667 reverted (where my tg3
works). Here's the results:

[standard 3.18.0 (=problematic)]:
[    2.197653] libphy: tg3 mdio bus: probed
[    2.257488] tg3 0000:02:00.0 eth0:
        Tigon3 [partno(BCM57780) rev 57780001] (PCI Express) MAC address
        00:19:99:ce:13:a6
[    2.259589] tg3 0000:02:00.0 eth0:
        attached PHY driver [Broadcom BCM57780] (mii_bus:phy_addr=200:01)
[    2.261740] tg3 0000:02:00.0 eth0:
        RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[    2.263912] tg3 0000:02:00.0 eth0:
        dma_rwctrl[76180000] dma_mask[64-bit]
[...]
[   10.028002] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
[   10.028247] tg3 0000:02:00.0 enp2s0: XX: Boom!
[   12.157034] tg3 0000:02:00.0 enp2s0: No firmware running

[3.18.0 without above mentioned patch, 3.17.3 is the same, both result
in a working tg3]:
[    1.397167] libphy: tg3 mdio bus: probed
[    1.456473] tg3 0000:02:00.0
        (unnamed net_device) (uninitialized): XX: Boom!
[    1.464987] tg3 0000:02:00.0 eth0:
        Tigon3 [partno(BCM57780) rev 57780001] (PCI Express) MAC address
        00:19:99:ce:13:a6
[    1.467118] tg3 0000:02:00.0 eth0:
        attached PHY driver [Broadcom BCM57780] (mii_bus:phy_addr=200:01)
[    1.469311] tg3 0000:02:00.0 eth0:
        RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[    1.471500] tg3 0000:02:00.0 eth0:
        dma_rwctrl[76180000] dma_mask[64-bit]
[...]
[    9.631629] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
[    9.631962] tg3 0000:02:00.0 enp2s0: XX: Boom!
[    9.634339] tg3 0000:02:00.0 enp2s0: XX: Boom!
[    9.642741] IPv6:
        ADDRCONF(NETDEV_UP): enp2s0: link is not ready
[   10.479636] tg3 0000:02:00.0
        enp2s0: Link is down
[   11.484498] tg3 0000:02:00.0
        enp2s0: Link is up at 100 Mbps, full duplex

As can be seen, there are two tg3-related sections in my dmesg in both
the working and non-working scenarios: At about 1 - 2 secs, the card
seems to begin initializing, and at about 9 - 10 seconds it is (or
should be) ready to establish a network connection.

My debug section, or tg3.c's tg3_poll_fw(), seems to be called thrice
in the working situation: The first hit occurs at 1.456473 where the tg3
device is still reported as "(unnamed net_device) (uninitialized)".
Then, the section gets hit twice again at around 9.63 - at this point
the driver already reports the card as initialized / by its real name.

In the non-working situation, the debug sections seems to be hit only
once, at 10.028247. At this point, the tg3 is already reported as
initialized - just like when it's hit the second and third time in the
working situation.

Bottom line is that commit 89665a6a71408796565bfd29cfa6a7877b17a667
really makes a difference regarding the way the tg3 card is
initialized, which seems to cause the problem.

Greetings,
Nils

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-13 21:02 [bisected] tg3 broken in 3.18.0? Nils Holland
@ 2014-12-15 15:06 ` Marcelo Ricardo Leitner
  2014-12-16 16:04   ` Rajat Jain
  2014-12-16  0:31 ` Bjorn Helgaas
  1 sibling, 1 reply; 24+ messages in thread
From: Marcelo Ricardo Leitner @ 2014-12-15 15:06 UTC (permalink / raw)
  To: Nils Holland, David Miller; +Cc: netdev, linux-pci, rajatxjain

On 13-12-2014 19:02, Nils Holland wrote:
> rajatxjain@gmail.com
> Bcc:
> Subject: Re: [bisected] tg3 broken in 3.18.0?
> Reply-To:
> In-Reply-To: <20141212.201831.186234837340644301.davem@davemloft.net>
>
> On Fri, Dec 12, 2014 at 08:18:31PM -0500, David Miller wrote:
>> From: Nils Holland <nholland@tisys.org>
>> Date: Sat, 13 Dec 2014 02:14:08 +0100
>>
>>>
>>> My bisect exercise suggests that the following commit is the culprit:
>>>
>>> 89665a6a71408796565bfd29cfa6a7877b17a667 (PCI: Check only the Vendor
>>> ID to identify Configuration Request Retry)
>>
>> You definitely need to bring this up with the author of that change
>> and the relevent list for the PCI subsystem and/or linux-kernel.
>
> I've now already sent an inquiry to Rajat Jain, the author of the
> patch in question, and this message here is now also CC'd to
> linux-pci@.
>
> With this message, I'd like to add one last result of investigation
> I've done today, in the hope that it will aid the folks with more
> knowledge to go after the issue.

FWIW, reverting this change fixes tg3 in here too.

Thanks Nils for doing the bisect!

With these debugs (note the re-revert):

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c 

index 2306268..4474502 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1436,14 +1436,22 @@ bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, 
int devfn, u32 *l,
                 return false;

         /* Configuration request Retry Status */
-       while (*l == 0xffff0001) {
-               if (!crs_timeout)
+       printk ("pci %04x:%02x:%02x.%d: 1st %x %x\n", pci_domain_nr(bus), 
bus->number,
+               PCI_SLOT(devfn), PCI_FUNC(devfn), *l, *l & 0xffff);
+       while ((*l & 0xffff) == 0x0001) {
+               if (!crs_timeout) {
+                       printk ("pci %04x:%02x:%02x.%d: crs_timeout: %d\n", 
pci_domain_nr(bus),
+                               bus->number, PCI_SLOT(devfn), PCI_FUNC(devfn), 
crs_timeout);
                         return false;
+               }

                 msleep(delay);
                 delay *= 2;
-               if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
+               if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l)) {
+                       printk ("pci %04x:%02x:%02x.%d: 
pci_bus_read_config_dword failed\n", pci_domain_nr(bus),
+                               bus->number, PCI_SLOT(devfn), PCI_FUNC(devfn));
                         return false;
+               }
                 /* Card hasn't responded in 60 seconds?  Must be stuck. */
                 if (delay > crs_timeout) {
                         printk(KERN_WARNING "pci %04x:%02x:%02x.%d: not 
responding\n",
@@ -1451,6 +1459,7 @@ bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int 
devfn, u32 *l,
                                PCI_FUNC(devfn));
                         return false;
                 }
+               printk ("pci %04x:%02x:%02x.%d: %x %x\n", pci_domain_nr(bus), 
bus->number, PCI_SLOT(devfn), PCI_FUNC(devfn), *l, *l & 0xffff);
         }

         return true;

I'm getting, with commit 89665a6a71408796565bfd29cfa6a7877b17a667:

$ grep 'pci 0000:02' tg3.bad
[    0.190733] pci 0000:02:00.0: 1st 165a14e4 14e4
[    0.190736] pci 0000:02:00.0: 1st 165a14e4 14e4
[    0.190810] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
[    0.190885] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
[    0.191048] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
[    0.191382] pci 0000:02:00.0: PME# supported from D3hot D3cold
[    0.191438] pci 0000:02:00.0: System wakeup disabled by ACPI
[    1.561555] pci 0000:02:00.0: 1st 1 1
[    1.561558] pci 0000:02:00.0: crs_timeout: 0
[   20.412021] pci 0000:02:00.0: 1st 1 1
[   20.412022] pci 0000:02:00.0: crs_timeout: 0
[   20.413596] pci 0000:02:00.0: 1st 1 1
[   20.413598] pci 0000:02:00.0: crs_timeout: 0

And without it:

$ grep 'pci 0000:02' tg3.good
[    0.190734] pci 0000:02:00.0: 1st 165a14e4 14e4
[    0.190738] pci 0000:02:00.0: 1st 165a14e4 14e4
[    0.190811] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
[    0.190884] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
[    0.191047] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
[    0.191380] pci 0000:02:00.0: PME# supported from D3hot D3cold
[    0.191439] pci 0000:02:00.0: System wakeup disabled by ACPI
[    1.576778] pci 0000:02:00.0: 1st 1 1
[   19.068517] pci 0000:02:00.0: 1st 165a14e4 14e4

Hope that helps!

   Marcelo

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-15 15:06 ` Marcelo Ricardo Leitner
@ 2014-12-16 16:04   ` Rajat Jain
  2014-12-16 16:20     ` Bjorn Helgaas
  2014-12-16 18:00     ` Marcelo Ricardo Leitner
  0 siblings, 2 replies; 24+ messages in thread
From: Rajat Jain @ 2014-12-16 16:04 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Nils Holland, David Miller, netdev, linux-pci@vger.kernel.org

Hello All,

Apologies for jumping in late, but for some reason I do not see the
original mail in my inbox. However I am taking a look at the mails as
sent on linux-pci (and I will keep an eye out for the bug report that
Bjorn asked for).

>
> I'm getting, with commit 89665a6a71408796565bfd29cfa6a7877b17a667:
>
> $ grep 'pci 0000:02' tg3.bad
> [    0.190733] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190736] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190810] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
> [    0.190885] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
> [    0.191048] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
> [    0.191382] pci 0000:02:00.0: PME# supported from D3hot D3cold
> [    0.191438] pci 0000:02:00.0: System wakeup disabled by ACPI
> [    1.561555] pci 0000:02:00.0: 1st 1 1
> [    1.561558] pci 0000:02:00.0: crs_timeout: 0
> [   20.412021] pci 0000:02:00.0: 1st 1 1
> [   20.412022] pci 0000:02:00.0: crs_timeout: 0
> [   20.413596] pci 0000:02:00.0: 1st 1 1
> [   20.413598] pci 0000:02:00.0: crs_timeout: 0
>
> And without it:
>
> $ grep 'pci 0000:02' tg3.good
> [    0.190734] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190738] pci 0000:02:00.0: 1st 165a14e4 14e4
> [    0.190811] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
> [    0.190884] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
> [    0.191047] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
> [    0.191380] pci 0000:02:00.0: PME# supported from D3hot D3cold
> [    0.191439] pci 0000:02:00.0: System wakeup disabled by ACPI
> [    1.576778] pci 0000:02:00.0: 1st 1 1
> [   19.068517] pci 0000:02:00.0: 1st 165a14e4 14e4
>

It seems that in the first 2 attempts that were made to probe the
device are all OK and return regular device ID and vendor ID for TG3
(CRS does not have a role to play). However, later attempts return a
CRS.

1) May I ask if you are using acpihp or pciehp? I assume pciehp?

2) Can you please also send dmesg output while passing
pciehp.pciehp_debug=1? In the fail case, do you see a message
indicating the pciehp gave up since it got CRS for a long time
(something like "pci 0000:02:00.0 id reading try 50 times with
interval 20 ms to get ffff0001")?

3) Currently the pciehp passes "0" for the argument "crs_timeout" to
pci_bus_read_dev_vendor_id(). Can you please try increasing it to, say
30 seconds (30 * 1000). (For comparison data, acpihp uses the value
60*1000 i.e. 60 seconds today) and run the fail case once again?

Thanks a lot in advance for the debugging help ;-)

Rajat

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-16 16:04   ` Rajat Jain
@ 2014-12-16 16:20     ` Bjorn Helgaas
  2014-12-16 17:15       ` Michael Chan
  2014-12-16 18:00     ` Marcelo Ricardo Leitner
  1 sibling, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2014-12-16 16:20 UTC (permalink / raw)
  To: Rajat Jain
  Cc: Marcelo Ricardo Leitner, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki, Prashant Sreedharan,
	Michael Chan

[+cc Rafael, Prashant, Michael]

On Tue, Dec 16, 2014 at 9:04 AM, Rajat Jain <rajatxjain@gmail.com> wrote:
> Hello All,
>
> Apologies for jumping in late, but for some reason I do not see the
> original mail in my inbox. However I am taking a look at the mails as
> sent on linux-pci (and I will keep an eye out for the bug report that
> Bjorn asked for).
>
>
>>
>> I'm getting, with commit 89665a6a71408796565bfd29cfa6a7877b17a667:
>>
>> $ grep 'pci 0000:02' tg3.bad
>> [    0.190733] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190736] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190810] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
>> [    0.190885] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
>> [    0.191048] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
>> [    0.191382] pci 0000:02:00.0: PME# supported from D3hot D3cold
>> [    0.191438] pci 0000:02:00.0: System wakeup disabled by ACPI
>> [    1.561555] pci 0000:02:00.0: 1st 1 1
>> [    1.561558] pci 0000:02:00.0: crs_timeout: 0
>> [   20.412021] pci 0000:02:00.0: 1st 1 1
>> [   20.412022] pci 0000:02:00.0: crs_timeout: 0
>> [   20.413596] pci 0000:02:00.0: 1st 1 1
>> [   20.413598] pci 0000:02:00.0: crs_timeout: 0
>>
>> And without it:
>>
>> $ grep 'pci 0000:02' tg3.good
>> [    0.190734] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190738] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190811] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
>> [    0.190884] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
>> [    0.191047] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
>> [    0.191380] pci 0000:02:00.0: PME# supported from D3hot D3cold
>> [    0.191439] pci 0000:02:00.0: System wakeup disabled by ACPI
>> [    1.576778] pci 0000:02:00.0: 1st 1 1
>> [   19.068517] pci 0000:02:00.0: 1st 165a14e4 14e4
>>
>
> It seems that in the first 2 attempts that were made to probe the
> device are all OK and return regular device ID and vendor ID for TG3
> (CRS does not have a role to play). However, later attempts return a
> CRS.
>
> 1) May I ask if you are using acpihp or pciehp? I assume pciehp?
>
> 2) Can you please also send dmesg output while passing
> pciehp.pciehp_debug=1? In the fail case, do you see a message
> indicating the pciehp gave up since it got CRS for a long time
> (something like "pci 0000:02:00.0 id reading try 50 times with
> interval 20 ms to get ffff0001")?
>
> 3) Currently the pciehp passes "0" for the argument "crs_timeout" to
> pci_bus_read_dev_vendor_id(). Can you please try increasing it to, say
> 30 seconds (30 * 1000). (For comparison data, acpihp uses the value
> 60*1000 i.e. 60 seconds today) and run the fail case once again?

Using zero for the timeout seems bogus to me.  But I doubt pciehp is
involved in this situation.

I think we're in this path:

    tg3_init_hw
      tg3_reset_hw
        tg3_disable_ints
        tg3_stop_fw
        tg3_write_sig_pre_reset
        tg3_chip_reset
          pci_device_is_present
            pci_bus_read_dev_vendor_id

and in this case pci_device_is_present() also passes a timeout of zero
to pci_bus_read_dev_vendor_id().  My guess is that tg3 is resetting
the device, so it's not too surprising that the config read returns
CRS status immediately afterward.

Bjorn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-16 16:20     ` Bjorn Helgaas
@ 2014-12-16 17:15       ` Michael Chan
  2014-12-16 17:59         ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Chan @ 2014-12-16 17:15 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rajat Jain, Marcelo Ricardo Leitner, Nils Holland, David Miller,
	netdev, linux-pci@vger.kernel.org, Rafael Wysocki,
	Prashant Sreedharan

On Tue, 2014-12-16 at 09:20 -0700, Bjorn Helgaas wrote:
> I think we're in this path:
> 
>     tg3_init_hw
>       tg3_reset_hw
>         tg3_disable_ints
>         tg3_stop_fw
>         tg3_write_sig_pre_reset
>         tg3_chip_reset
>           pci_device_is_present
>             pci_bus_read_dev_vendor_id
> 
> and in this case pci_device_is_present() also passes a timeout of zero
> to pci_bus_read_dev_vendor_id().  My guess is that tg3 is resetting
> the device, so it's not too surprising that the config read returns
> CRS status immediately afterward.
> 
At the point of calling pci_device_is_present(), chip reset hasn't
started yet, so there should be no problem reading config space.

In all the newer tg3 chips, chip reset does not reset the PCIE block.
So I think config space should always be accesible even during reset.
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-16 17:15       ` Michael Chan
@ 2014-12-16 17:59         ` Marcelo Ricardo Leitner
  2014-12-16 19:54           ` Michael Chan
  0 siblings, 1 reply; 24+ messages in thread
From: Marcelo Ricardo Leitner @ 2014-12-16 17:59 UTC (permalink / raw)
  To: Michael Chan, Bjorn Helgaas
  Cc: Rajat Jain, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki, Prashant Sreedharan

On 16-12-2014 15:15, Michael Chan wrote:
> On Tue, 2014-12-16 at 09:20 -0700, Bjorn Helgaas wrote:
>> I think we're in this path:
>>
>>      tg3_init_hw
>>        tg3_reset_hw
>>          tg3_disable_ints
>>          tg3_stop_fw
>>          tg3_write_sig_pre_reset
>>          tg3_chip_reset
>>            pci_device_is_present
>>              pci_bus_read_dev_vendor_id
>>
>> and in this case pci_device_is_present() also passes a timeout of zero
>> to pci_bus_read_dev_vendor_id().  My guess is that tg3 is resetting
>> the device, so it's not too surprising that the config read returns
>> CRS status immediately afterward.
>>
> At the point of calling pci_device_is_present(), chip reset hasn't
> started yet, so there should be no problem reading config space.
> 
> In all the newer tg3 chips, chip reset does not reset the PCIE block.
> So I think config space should always be accesible even during reset.

It's a 
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express
over here

I put a WARN_ON(1) after those printks, and this is what I got:

[    1.550640] pci 0000:02:00.0: 1st 1 1
[    1.550643] pci 0000:02:00.0: crs_timeout: 0
[    1.550645] ------------[ cut here ]------------
[    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
[    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
[    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
[    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
[    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
[    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
[    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
[    1.550669] Call Trace:
[    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
[    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
[    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
[    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
[    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
[    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
[    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
[    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
[    1.550705]  [<ffffffff8127d2bb>] ? kernfs_activate+0x7b/0xf0
[    1.550708]  [<ffffffff813bbdc5>] local_pci_probe+0x45/0xa0
[    1.550711]  [<ffffffff8127fa8d>] ? sysfs_do_create_link_sd.isra.2+0x6d/0xc0
[    1.550714]  [<ffffffff813bd1b9>] pci_device_probe+0xf9/0x150
[    1.550717]  [<ffffffff814906fd>] driver_probe_device+0x12d/0x3d0
[    1.550720]  [<ffffffff81490a7b>] __driver_attach+0x9b/0xa0
[    1.550722]  [<ffffffff814909e0>] ? __device_attach+0x40/0x40
[    1.550724]  [<ffffffff8148e4f3>] bus_for_each_dev+0x73/0xc0
[    1.550726]  [<ffffffff814900ee>] driver_attach+0x1e/0x20
[    1.550729]  [<ffffffff8148fcb0>] bus_add_driver+0x180/0x250
[    1.550731]  [<ffffffffa0050000>] ? 0xffffffffa0050000
[    1.550733]  [<ffffffff81491274>] driver_register+0x64/0xf0
[    1.550735]  [<ffffffff813bb72b>] __pci_register_driver+0x4b/0x50
[    1.550739]  [<ffffffffa005001e>] tg3_driver_init+0x1e/0x1000 [tg3]
[    1.550742]  [<ffffffff81002144>] do_one_initcall+0xd4/0x210
[    1.550747]  [<ffffffff811cbc42>] ? __vunmap+0xc2/0x110
[    1.550751]  [<ffffffff8111336b>] load_module+0x1cab/0x2730
[    1.550753]  [<ffffffff8110efc0>] ? store_uevent+0x70/0x70
[    1.550756]  [<ffffffff8120b090>] ? kernel_read+0x50/0x80
[    1.550760]  [<ffffffff81113fa6>] SyS_finit_module+0xa6/0xd0
[    1.550763]  [<ffffffff81745129>] system_call_fastpath+0x12/0x17
[    1.550764] ---[ end trace 4cc3153e369484ea ]---
[    1.550963] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM95722) rev a200] (PCI Express) MAC address 00:0a:f7:2b:9b:39
[    1.550965] tg3 0000:02:00.0 eth0: attached PHY is 5722/5756 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
[    1.550966] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[    1.550967] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
[    1.556112] tg3 0000:02:00.0 p1p1: renamed from eth0
...

[   23.545119] tg3 0000:02:00.0: irq 32 for MSI/MSI-X
[   25.424981] tg3 0000:02:00.0 p1p1: No firmware running
[   25.425686] pci 0000:02:00.0: 1st 1 1
[   25.425687] pci 0000:02:00.0: crs_timeout: 0
[   25.425687] ------------[ cut here ]------------
[   25.425691] WARNING: CPU: 0 PID: 1590 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
[   25.425692] Modules linked in: bridge stp llc openvswitch x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_co
dec_generic snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul snd_hwdep crc32_pclmul crc32c_intel ghash_clmulni_intel snd_seq mei_me snd_seq_d
evice snd_pcm iTCO_wdt iTCO_vendor_support snd_timer mei snd lpc_ich i2c_i801 pcspkr mfd_core dcdbas serio_raw soundcore microcode ie31200_edac shpchp edac_
core nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc xfs libcrc32c i915 raid0 i2c_algo_bit drm_kms_helper drm e1000e tg3 ptp pps_core video
[   25.425714] CPU: 0 PID: 1590 Comm: ip Tainted: G        W      3.18.0-rc6+ #8
[   25.425715] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
[   25.425716]  0000000000000000 0000000097b01d0c ffff8807f0687408 ffffffff8173db46
[   25.425717]  0000000000000000 0000000000000000 ffff8807f0687448 ffffffff81094d41
[   25.425719]  ffff8807f0687428 ffff8807f1e27000 0000000000000000 0000000000000000
[   25.425720] Call Trace:
[   25.425723]  [<ffffffff8173db46>] dump_stack+0x46/0x58
[   25.425726]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
[   25.425728]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
[   25.425729]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
[   25.425733]  [<ffffffffa0028b47>] ? tg3_phy_auxctl_write+0x27/0x30 [tg3]
[   25.425735]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
[   25.425738]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
[   25.425740]  [<ffffffffa003756d>] tg3_reset_hw+0x8d/0x2ce0 [tg3]
[   25.425743]  [<ffffffff81383f6a>] ? delay_tsc+0x4a/0x80
[   25.425744]  [<ffffffff81383eec>] ? __udelay+0x2c/0x30
[   25.425747]  [<ffffffffa0026be4>] ? _tw32_flush+0x44/0x80 [tg3]
[   25.425749]  [<ffffffffa003a216>] tg3_init_hw+0x56/0x60 [tg3]
[   25.425751]  [<ffffffffa003c0d5>] tg3_start+0xbe5/0x1210 [tg3]
[   25.425753]  [<ffffffff81383eec>] ? __udelay+0x2c/0x30
[   25.425755]  [<ffffffffa0026be4>] ? _tw32_flush+0x44/0x80 [tg3]
[   25.425757]  [<ffffffffa003c828>] tg3_open+0x128/0x2e0 [tg3]
[   25.425760]  [<ffffffff8162c6cf>] __dev_open+0xcf/0x140
[   25.425761]  [<ffffffff8162c9f1>] __dev_change_flags+0xa1/0x160
[   25.425762]  [<ffffffff8162cad9>] dev_change_flags+0x29/0x60
[   25.425764]  [<ffffffff8163a4a9>] do_setlink+0x399/0xa90
[   25.425766]  [<ffffffff8163ca7c>] rtnl_newlink+0x51c/0x740
[   25.425768]  [<ffffffff8163c653>] ? rtnl_newlink+0xf3/0x740
[   25.425771]  [<ffffffff811e730c>] ? new_slab+0x14c/0x490
[   25.425774]  [<ffffffff81303188>] ? security_capable+0x18/0x20
[   25.425776]  [<ffffffff8109cf7d>] ? ns_capable+0x2d/0x60
[   25.425778]  [<ffffffff816391a4>] rtnetlink_rcv_msg+0xa4/0x270
[   25.425780]  [<ffffffff8165840d>] ? __netlink_lookup+0x4d/0x70
[   25.425781]  [<ffffffff81639100>] ? rtnetlink_rcv+0x40/0x40
[   25.425783]  [<ffffffff8165c4a1>] netlink_rcv_skb+0xc1/0xe0
[   25.425784]  [<ffffffff816390ec>] rtnetlink_rcv+0x2c/0x40
[   25.425785]  [<ffffffff8165ba26>] netlink_unicast+0x106/0x210
[   25.425787]  [<ffffffff8165be55>] netlink_sendmsg+0x325/0x790
[   25.425788]  [<ffffffff8160de50>] sock_sendmsg+0xa0/0xe0
[   25.425791]  [<ffffffff8120e8cd>] ? lookup_real+0x1d/0x50
[   25.425792]  [<ffffffff8160e394>] ___sys_sendmsg+0x2f4/0x310
[   25.425794]  [<ffffffff8119bdf2>] ? lru_cache_add_active_or_unevictable+0x32/0xc0
[   25.425796]  [<ffffffff8160c673>] ? sock_destroy_inode+0x33/0x40
...
[   25.425794]  [<ffffffff8119bdf2>] ? lru_cache_add_active_or_unevictable+0x32/0xc0
[   25.425796]  [<ffffffff8160c673>] ? sock_destroy_inode+0x33/0x40
[   25.425798]  [<ffffffff8121bfd5>] ? __dentry_kill+0x145/0x1d0
[   25.425799]  [<ffffffff8121c105>] ? dput+0xa5/0x170
[   25.425800]  [<ffffffff81224f74>] ? mntput+0x24/0x40
[   25.425802]  [<ffffffff81206d6a>] ? __fput+0x17a/0x1e0
[   25.425803]  [<ffffffff8160ee21>] __sys_sendmsg+0x51/0x90
[   25.425805]  [<ffffffff8160ee72>] SyS_sendmsg+0x12/0x20
[   25.425807]  [<ffffffff81745129>] system_call_fastpath+0x12/0x17
[   25.425808] ---[ end trace 4cc3153e369484eb ]---
[   25.427385] pci 0000:02:00.0: 1st 1 1
[   25.427386] pci 0000:02:00.0: crs_timeout: 0
[   25.427387] ------------[ cut here ]------------
[   25.427389] WARNING: CPU: 0 PID: 1590 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
[   25.427389] Modules linked in: bridge stp llc openvswitch x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec crct10dif_pclmul snd_hwdep crc32_pclmul crc32c_intel ghash_clmulni_intel snd_seq mei_me snd_seq_device snd_pcm iTCO_wdt iTCO_vendor_support snd_timer mei snd lpc_ich i2c_i801 pcspkr mfd_core dcdbas serio_raw soundcore microcode ie31200_edac shpchp edac_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc xfs libcrc32c i915 raid0 i2c_algo_bit drm_kms_helper drm e1000e tg3 ptp pps_core video
[   25.427403] CPU: 0 PID: 1590 Comm: ip Tainted: G        W      3.18.0-rc6+ #8
[   25.427404] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
[   25.427405]  0000000000000000 0000000097b01d0c ffff8807f0687488 ffffffff8173db46
[   25.427406]  0000000000000000 0000000000000000 ffff8807f06874c8 ffffffff81094d41
[   25.427416]  ffff8807f06874a8 ffff8807f1e27000 0000000000000000 0000000000000000
[   25.427417] Call Trace:
[   25.427418]  [<ffffffff8173db46>] dump_stack+0x46/0x58
[   25.427420]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
[   25.427421]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
[   25.427423]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
[   25.427425]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
[   25.427427]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
[   25.427430]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
[   25.427433]  [<ffffffffa003c208>] tg3_start+0xd18/0x1210 [tg3]
[   25.427434]  [<ffffffff81383eec>] ? __udelay+0x2c/0x30
[   25.427437]  [<ffffffffa0026be4>] ? _tw32_flush+0x44/0x80 [tg3]
[   25.427439]  [<ffffffffa003c828>] tg3_open+0x128/0x2e0 [tg3]
[   25.427441]  [<ffffffff8162c6cf>] __dev_open+0xcf/0x140
[   25.427442]  [<ffffffff8162c9f1>] __dev_change_flags+0xa1/0x160
[   25.427443]  [<ffffffff8162cad9>] dev_change_flags+0x29/0x60
[   25.427445]  [<ffffffff8163a4a9>] do_setlink+0x399/0xa90
[   25.427448]  [<ffffffff8163ca7c>] rtnl_newlink+0x51c/0x740
[   25.427449]  [<ffffffff8163c653>] ? rtnl_newlink+0xf3/0x740
[   25.427449]  [<ffffffff8163c653>] ? rtnl_newlink+0xf3/0x740
[   25.427452]  [<ffffffff811e730c>] ? new_slab+0x14c/0x490
[   25.427454]  [<ffffffff81303188>] ? security_capable+0x18/0x20
[   25.427455]  [<ffffffff8109cf7d>] ? ns_capable+0x2d/0x60
[   25.427457]  [<ffffffff816391a4>] rtnetlink_rcv_msg+0xa4/0x270
[   25.427459]  [<ffffffff8165840d>] ? __netlink_lookup+0x4d/0x70
[   25.427460]  [<ffffffff81639100>] ? rtnetlink_rcv+0x40/0x40
[   25.427462]  [<ffffffff8165c4a1>] netlink_rcv_skb+0xc1/0xe0
[   25.427464]  [<ffffffff816390ec>] rtnetlink_rcv+0x2c/0x40
[   25.427465]  [<ffffffff8165ba26>] netlink_unicast+0x106/0x210
[   25.427466]  [<ffffffff8165be55>] netlink_sendmsg+0x325/0x790
[   25.427468]  [<ffffffff8160de50>] sock_sendmsg+0xa0/0xe0
[   25.427469]  [<ffffffff8120e8cd>] ? lookup_real+0x1d/0x50
[   25.427471]  [<ffffffff8160e394>] ___sys_sendmsg+0x2f4/0x310
[   25.427472]  [<ffffffff8119bdf2>] ? lru_cache_add_active_or_unevictable+0x32/0xc0
[   25.427475]  [<ffffffff8160c673>] ? sock_destroy_inode+0x33/0x40
[   25.427477]  [<ffffffff8121bfd5>] ? __dentry_kill+0x145/0x1d0
[   25.427478]  [<ffffffff8121c105>] ? dput+0xa5/0x170
[   25.427479]  [<ffffffff81224f74>] ? mntput+0x24/0x40
[   25.427481]  [<ffffffff81206d6a>] ? __fput+0x17a/0x1e0
[   25.427482]  [<ffffffff8160ee21>] __sys_sendmsg+0x51/0x90
[   25.427483]  [<ffffffff8160ee72>] SyS_sendmsg+0x12/0x20
[   25.427493]  [<ffffffff81745129>] system_call_fastpath+0x12/0x17
[   25.427494] ---[ end trace 4cc3153e369484ec ]---

  Marcelo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-16 17:59         ` Marcelo Ricardo Leitner
@ 2014-12-16 19:54           ` Michael Chan
  2014-12-16 20:02             ` Marcelo Ricardo Leitner
  2014-12-18 19:15             ` Bjorn Helgaas
  0 siblings, 2 replies; 24+ messages in thread
From: Michael Chan @ 2014-12-16 19:54 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Bjorn Helgaas, Rajat Jain, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki, Prashant Sreedharan

On Tue, 2014-12-16 at 15:59 -0200, Marcelo Ricardo Leitner wrote:
> It's a 
> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722
> Gigabit Ethernet PCI Express
> over here
> 
> I put a WARN_ON(1) after those printks, and this is what I got:
> 
> [    1.550640] pci 0000:02:00.0: 1st 1 1
> [    1.550643] pci 0000:02:00.0: crs_timeout: 0
> [    1.550645] ------------[ cut here ]------------
> [    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
> [    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
> [    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
> [    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
> [    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
> [    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
> [    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
> [    1.550669] Call Trace:
> [    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
> [    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
> [    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
> [    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
> [    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
> [    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
> [    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
> [    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3] 

So does it work if you use a non-zero crs_timeout?  The driver has
called tg3_halt() which may affect configuration read responses.  I need
to check with the hardware team to see if the 5722 will return CRS in
this scenario.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-16 19:54           ` Michael Chan
@ 2014-12-16 20:02             ` Marcelo Ricardo Leitner
  2014-12-18 19:15             ` Bjorn Helgaas
  1 sibling, 0 replies; 24+ messages in thread
From: Marcelo Ricardo Leitner @ 2014-12-16 20:02 UTC (permalink / raw)
  To: Michael Chan
  Cc: Bjorn Helgaas, Rajat Jain, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki, Prashant Sreedharan

On 16-12-2014 17:54, Michael Chan wrote:
> On Tue, 2014-12-16 at 15:59 -0200, Marcelo Ricardo Leitner wrote:
>> It's a
>> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722
>> Gigabit Ethernet PCI Express
>> over here
>>
>> I put a WARN_ON(1) after those printks, and this is what I got:
>>
>> [    1.550640] pci 0000:02:00.0: 1st 1 1
>> [    1.550643] pci 0000:02:00.0: crs_timeout: 0
>> [    1.550645] ------------[ cut here ]------------
>> [    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
>> [    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
>> [    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
>> [    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
>> [    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
>> [    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
>> [    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
>> [    1.550669] Call Trace:
>> [    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
>> [    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
>> [    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
>> [    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
>> [    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
>> [    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
>> [    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
>> [    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
>
> So does it work if you use a non-zero crs_timeout?  The driver has
> called tg3_halt() which may affect configuration read responses.  I need
> to check with the hardware team to see if the 5722 will return CRS in
> this scenario.

Sorry, I replied to the thread that you weren't in yet.
It didn't..
http://thread.gmane.org/gmane.linux.network/342566/focus=37932

   Marcelo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-16 19:54           ` Michael Chan
  2014-12-16 20:02             ` Marcelo Ricardo Leitner
@ 2014-12-18 19:15             ` Bjorn Helgaas
  2014-12-18 19:28               ` Prashant Sreedharan
  1 sibling, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2014-12-18 19:15 UTC (permalink / raw)
  To: Michael Chan
  Cc: Marcelo Ricardo Leitner, Rajat Jain, Nils Holland, David Miller,
	netdev, linux-pci@vger.kernel.org, Rafael Wysocki,
	Prashant Sreedharan

On Tue, Dec 16, 2014 at 12:54 PM, Michael Chan <mchan@broadcom.com> wrote:
> On Tue, 2014-12-16 at 15:59 -0200, Marcelo Ricardo Leitner wrote:
>> It's a
>> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722
>> Gigabit Ethernet PCI Express
>> over here
>>
>> I put a WARN_ON(1) after those printks, and this is what I got:
>>
>> [    1.550640] pci 0000:02:00.0: 1st 1 1
>> [    1.550643] pci 0000:02:00.0: crs_timeout: 0
>> [    1.550645] ------------[ cut here ]------------
>> [    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
>> [    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
>> [    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
>> [    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
>> [    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
>> [    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
>> [    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
>> [    1.550669] Call Trace:
>> [    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
>> [    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
>> [    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
>> [    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
>> [    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
>> [    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
>> [    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
>> [    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
>
> So does it work if you use a non-zero crs_timeout?  The driver has
> called tg3_halt() which may affect configuration read responses.  I need
> to check with the hardware team to see if the 5722 will return CRS in
> this scenario.

Any updates from the hardware team?

This is a pretty serious regression, but as far as I can tell, it is
not a PCI bug.  The device should respond to a config read of vendor
ID.  If the driver does something that make the read return CRS
status, I think the driver is responsible for doing whatever delay or
other fixup is required.

I'm inclined to reassign this bug to the tg3 driver unless you think
the PCI core is doing something wrong here.

Bjorn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-18 19:15             ` Bjorn Helgaas
@ 2014-12-18 19:28               ` Prashant Sreedharan
  2014-12-18 20:09                 ` Marcelo Ricardo Leitner
  2014-12-18 20:26                 ` Nils Holland
  0 siblings, 2 replies; 24+ messages in thread
From: Prashant Sreedharan @ 2014-12-18 19:28 UTC (permalink / raw)
  To: Bjorn Helgaas, Marcelo Ricardo Leitner
  Cc: Michael Chan, Rajat Jain, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki

On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
> On Tue, Dec 16, 2014 at 12:54 PM, Michael Chan <mchan@broadcom.com> wrote:
> > On Tue, 2014-12-16 at 15:59 -0200, Marcelo Ricardo Leitner wrote:
> >> It's a
> >> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722
> >> Gigabit Ethernet PCI Express
> >> over here
> >>
> >> I put a WARN_ON(1) after those printks, and this is what I got:
> >>
> >> [    1.550640] pci 0000:02:00.0: 1st 1 1
> >> [    1.550643] pci 0000:02:00.0: crs_timeout: 0
> >> [    1.550645] ------------[ cut here ]------------
> >> [    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
> >> [    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
> >> [    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
> >> [    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
> >> [    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
> >> [    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
> >> [    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
> >> [    1.550669] Call Trace:
> >> [    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
> >> [    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
> >> [    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
> >> [    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
> >> [    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
> >> [    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
> >> [    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
> >> [    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
> >
> > So does it work if you use a non-zero crs_timeout?  The driver has
> > called tg3_halt() which may affect configuration read responses.  I need
> > to check with the hardware team to see if the 5722 will return CRS in
> > this scenario.
> 
> Any updates from the hardware team?
> 
> This is a pretty serious regression, but as far as I can tell, it is
> not a PCI bug.  The device should respond to a config read of vendor
> ID.  If the driver does something that make the read return CRS
> status, I think the driver is responsible for doing whatever delay or
> other fixup is required.
> 
> I'm inclined to reassign this bug to the tg3 driver unless you think
> the PCI core is doing something wrong here.
> 
> Bjorn

We were not able to reproduce this issue, could you please check what is
the value of reg 0x70, before the pci_device_is_present call is made ?
if bit 15 is set config access will be retried.

--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
        void (*write_op)(struct tg3 *, u32, u32);
        int i, err;
 
+       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
        if (!pci_device_is_present(tp->pdev))
                return -ENODEV;
 

 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-18 19:28               ` Prashant Sreedharan
@ 2014-12-18 20:09                 ` Marcelo Ricardo Leitner
  2014-12-18 20:33                   ` Marcelo Ricardo Leitner
  2014-12-18 20:26                 ` Nils Holland
  1 sibling, 1 reply; 24+ messages in thread
From: Marcelo Ricardo Leitner @ 2014-12-18 20:09 UTC (permalink / raw)
  To: Prashant Sreedharan, Bjorn Helgaas
  Cc: Michael Chan, Rajat Jain, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki

On 18-12-2014 17:28, Prashant Sreedharan wrote:
> On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
>> On Tue, Dec 16, 2014 at 12:54 PM, Michael Chan <mchan@broadcom.com> wrote:
>>> On Tue, 2014-12-16 at 15:59 -0200, Marcelo Ricardo Leitner wrote:
>>>> It's a
>>>> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722
>>>> Gigabit Ethernet PCI Express
>>>> over here
>>>>
>>>> I put a WARN_ON(1) after those printks, and this is what I got:
>>>>
>>>> [    1.550640] pci 0000:02:00.0: 1st 1 1
>>>> [    1.550643] pci 0000:02:00.0: crs_timeout: 0
>>>> [    1.550645] ------------[ cut here ]------------
>>>> [    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
>>>> [    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
>>>> [    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
>>>> [    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
>>>> [    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
>>>> [    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
>>>> [    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
>>>> [    1.550669] Call Trace:
>>>> [    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
>>>> [    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
>>>> [    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
>>>> [    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
>>>> [    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
>>>> [    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
>>>> [    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
>>>> [    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
>>>
>>> So does it work if you use a non-zero crs_timeout?  The driver has
>>> called tg3_halt() which may affect configuration read responses.  I need
>>> to check with the hardware team to see if the 5722 will return CRS in
>>> this scenario.
>>
>> Any updates from the hardware team?
>>
>> This is a pretty serious regression, but as far as I can tell, it is
>> not a PCI bug.  The device should respond to a config read of vendor
>> ID.  If the driver does something that make the read return CRS
>> status, I think the driver is responsible for doing whatever delay or
>> other fixup is required.
>>
>> I'm inclined to reassign this bug to the tg3 driver unless you think
>> the PCI core is doing something wrong here.
>>
>> Bjorn
> 
> We were not able to reproduce this issue, could you please check what is
> the value of reg 0x70, before the pci_device_is_present call is made ?
> if bit 15 is set config access will be retried.
> 
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>          void (*write_op)(struct tg3 *, u32, u32);
>          int i, err;
>   
> +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>          if (!pci_device_is_present(tp->pdev))
>                  return -ENODEV;
>   

With that PCI patch applied and my debugs, without the timeout hack (so crs_timeout=0):

[    1.545554] config state: 12b2
[    1.548636] pci 0000:02:00.0: 1st 1 1
[    1.548637] pci 0000:02:00.0: crs_timeout: 0
[    1.548783] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM95722) rev a200] (PCI Express) MAC address 00:0a:f7:2b:9b:39
[    1.548785] tg3 0000:02:00.0 eth0: attached PHY is 5722/5756 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
[    1.548786] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[    1.548787] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
[    1.554389] tg3 0000:02:00.0 p1p1: renamed from eth0
...

That's the only time your printk got printed.

  Marcelo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-18 20:09                 ` Marcelo Ricardo Leitner
@ 2014-12-18 20:33                   ` Marcelo Ricardo Leitner
  0 siblings, 0 replies; 24+ messages in thread
From: Marcelo Ricardo Leitner @ 2014-12-18 20:33 UTC (permalink / raw)
  To: Prashant Sreedharan, Bjorn Helgaas
  Cc: Michael Chan, Rajat Jain, Nils Holland, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki

On 18-12-2014 18:09, Marcelo Ricardo Leitner wrote:
> On 18-12-2014 17:28, Prashant Sreedharan wrote:
>> On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
>>> On Tue, Dec 16, 2014 at 12:54 PM, Michael Chan <mchan@broadcom.com> wrote:
>>>> On Tue, 2014-12-16 at 15:59 -0200, Marcelo Ricardo Leitner wrote:
>>>>> It's a
>>>>> 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722
>>>>> Gigabit Ethernet PCI Express
>>>>> over here
>>>>>
>>>>> I put a WARN_ON(1) after those printks, and this is what I got:
>>>>>
>>>>> [    1.550640] pci 0000:02:00.0: 1st 1 1
>>>>> [    1.550643] pci 0000:02:00.0: crs_timeout: 0
>>>>> [    1.550645] ------------[ cut here ]------------
>>>>> [    1.550651] WARNING: CPU: 6 PID: 364 at drivers/pci/probe.c:1445 pci_bus_read_dev_vendor_id+0x1d4/0x1e0()
>>>>> [    1.550652] Modules linked in: i915(+) raid0 i2c_algo_bit drm_kms_helper drm e1000e(+) tg3(+) ptp pps_core video
>>>>> [    1.550660] CPU: 6 PID: 364 Comm: systemd-udevd Not tainted 3.18.0-rc6+ #8
>>>>> [    1.550661] Hardware name: Dell Inc. OptiPlex 9010/03K80F, BIOS A15 08/12/2013
>>>>> [    1.550662]  0000000000000000 000000004de2d8dc ffff8807eabdf948 ffffffff8173db46
>>>>> [    1.550665]  0000000000000000 0000000000000000 ffff8807eabdf988 ffffffff81094d41
>>>>> [    1.550667]  ffff8807eabdf968 ffff8807f1e27000 0000000000000000 0000000000000000
>>>>> [    1.550669] Call Trace:
>>>>> [    1.550675]  [<ffffffff8173db46>] dump_stack+0x46/0x58
>>>>> [    1.550679]  [<ffffffff81094d41>] warn_slowpath_common+0x81/0xa0
>>>>> [    1.550681]  [<ffffffff81094e5a>] warn_slowpath_null+0x1a/0x20
>>>>> [    1.550683]  [<ffffffff813b2864>] pci_bus_read_dev_vendor_id+0x1d4/0x1e0
>>>>> [    1.550687]  [<ffffffff813b7c3e>] pci_device_is_present+0x2e/0x50
>>>>> [    1.550693]  [<ffffffffa003364f>] tg3_chip_reset+0x2f/0x940 [tg3]
>>>>> [    1.550697]  [<ffffffffa0033f9f>] tg3_halt+0x3f/0x1e0 [tg3]
>>>>> [    1.550701]  [<ffffffffa0044f83>] tg3_init_one+0xb83/0x1a40 [tg3]
>>>>
>>>> So does it work if you use a non-zero crs_timeout?  The driver has
>>>> called tg3_halt() which may affect configuration read responses.  I need
>>>> to check with the hardware team to see if the 5722 will return CRS in
>>>> this scenario.
>>>
>>> Any updates from the hardware team?
>>>
>>> This is a pretty serious regression, but as far as I can tell, it is
>>> not a PCI bug.  The device should respond to a config read of vendor
>>> ID.  If the driver does something that make the read return CRS
>>> status, I think the driver is responsible for doing whatever delay or
>>> other fixup is required.
>>>
>>> I'm inclined to reassign this bug to the tg3 driver unless you think
>>> the PCI core is doing something wrong here.
>>>
>>> Bjorn
>>
>> We were not able to reproduce this issue, could you please check what is
>> the value of reg 0x70, before the pci_device_is_present call is made ?
>> if bit 15 is set config access will be retried.
>>
>> --- a/drivers/net/ethernet/broadcom/tg3.c
>> +++ b/drivers/net/ethernet/broadcom/tg3.c
>> @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>>           void (*write_op)(struct tg3 *, u32, u32);
>>           int i, err;
>>
>> +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>>           if (!pci_device_is_present(tp->pdev))
>>                   return -ENODEV;
>>
>
> With that PCI patch applied and my debugs, without the timeout hack (so crs_timeout=0):
>
> [    1.545554] config state: 12b2
> [    1.548636] pci 0000:02:00.0: 1st 1 1
> [    1.548637] pci 0000:02:00.0: crs_timeout: 0
> [    1.548783] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM95722) rev a200] (PCI Express) MAC address 00:0a:f7:2b:9b:39
> [    1.548785] tg3 0000:02:00.0 eth0: attached PHY is 5722/5756 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
> [    1.548786] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
> [    1.548787] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
> [    1.554389] tg3 0000:02:00.0 p1p1: renamed from eth0
> ...
>
> That's the only time your printk got printed.

My bad, I forgot I had configured the system to not bring that iface up 
anymore.. when doing so, just like Nils had too:

[ 1743.678714] tg3 0000:02:00.0: irq 32 for MSI/MSI-X
[ 1745.554039] tg3 0000:02:00.0 p1p1: No firmware running
[ 1745.554724] config state: 12b2
[ 1745.557822] pci 0000:02:00.0: 1st 1 1
[ 1745.557827] pci 0000:02:00.0: crs_timeout: 0
[ 1745.559383] config state: 12b2
[ 1745.562470] pci 0000:02:00.0: 1st 1 1
[ 1745.562471] pci 0000:02:00.0: crs_timeout: 0

   Marcelo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-18 19:28               ` Prashant Sreedharan
  2014-12-18 20:09                 ` Marcelo Ricardo Leitner
@ 2014-12-18 20:26                 ` Nils Holland
  2014-12-19  2:10                   ` Prashant Sreedharan
  1 sibling, 1 reply; 24+ messages in thread
From: Nils Holland @ 2014-12-18 20:26 UTC (permalink / raw)
  To: Prashant Sreedharan
  Cc: Bjorn Helgaas, Marcelo Ricardo Leitner, Michael Chan, Rajat Jain,
	David Miller, netdev, linux-pci@vger.kernel.org, Rafael Wysocki

On Thu, Dec 18, 2014 at 11:28:09AM -0800, Prashant Sreedharan wrote:
> On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
> > 
> > Any updates from the hardware team?
> > 
> > This is a pretty serious regression, but as far as I can tell, it is
> > not a PCI bug.  The device should respond to a config read of vendor
> > ID.  If the driver does something that make the read return CRS
> > status, I think the driver is responsible for doing whatever delay or
> > other fixup is required.
> > 
> > I'm inclined to reassign this bug to the tg3 driver unless you think
> > the PCI core is doing something wrong here.
> > 
> > Bjorn
> 
> We were not able to reproduce this issue, could you please check what is
> the value of reg 0x70, before the pci_device_is_present call is made ?
> if bit 15 is set config access will be retried.
> 
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>         void (*write_op)(struct tg3 *, u32, u32);
>         int i, err;
>  
> +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>         if (!pci_device_is_present(tp->pdev))
>                 return -ENODEV;

No problem, I gave this a try and here is what I get:

[    2.185190] libphy: tg3 mdio bus: probed
[    2.229357] tsc: Refined TSC clocksource calibration: 2399.999 MHz
[    2.244993] config state: 1292
[    2.247136] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM57780) rev 57780001]
        (PCI Express) MAC address 00:19:99:ce:13:a6
[    2.249279] tg3 0000:02:00.0 eth0: attached PHY driver [Broadcom BCM57780]
        (mii_bus:phy_addr=200:01)
[    2.251460] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0]
        MIirq[0] ASF[0] TSOcap[1]
[    2.253672] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
[...]
[   12.204692] tg3 0000:02:00.0
        enp2s0: No firmware running
[   12.206653] config state: 1292
[   12.208655] config state: 1292

That's all of the three times the new debugging line gets hit when I
boot my system using the supplied diagnostic patch.

Hope that helps - of course, I'd gladly test any further
(diagnostic) patches if required! Also, if I can provide any
additional information that might be of value, just ask:-)

Greetings,
Nils

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-18 20:26                 ` Nils Holland
@ 2014-12-19  2:10                   ` Prashant Sreedharan
  2014-12-19 17:09                     ` Bjorn Helgaas
  0 siblings, 1 reply; 24+ messages in thread
From: Prashant Sreedharan @ 2014-12-19  2:10 UTC (permalink / raw)
  To: Nils Holland, Marcelo Ricardo Leitner
  Cc: Bjorn Helgaas, Michael Chan, Rajat Jain, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki

On Thu, 2014-12-18 at 21:26 +0100, Nils Holland wrote:
> On Thu, Dec 18, 2014 at 11:28:09AM -0800, Prashant Sreedharan wrote:
> > On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
> > > 
> > > Any updates from the hardware team?
> > > 
> > > This is a pretty serious regression, but as far as I can tell, it is
> > > not a PCI bug.  The device should respond to a config read of vendor
> > > ID.  If the driver does something that make the read return CRS
> > > status, I think the driver is responsible for doing whatever delay or
> > > other fixup is required.
> > > 
> > > I'm inclined to reassign this bug to the tg3 driver unless you think
> > > the PCI core is doing something wrong here.
> > > 
> > > Bjorn
> > 
> > We were not able to reproduce this issue, could you please check what is
> > the value of reg 0x70, before the pci_device_is_present call is made ?
> > if bit 15 is set config access will be retried.
> > 
> > --- a/drivers/net/ethernet/broadcom/tg3.c
> > +++ b/drivers/net/ethernet/broadcom/tg3.c
> > @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
> >         void (*write_op)(struct tg3 *, u32, u32);
> >         int i, err;
> >  
> > +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
> >         if (!pci_device_is_present(tp->pdev))
> >                 return -ENODEV;
> 
> No problem, I gave this a try and here is what I get:
> 
> [    2.185190] libphy: tg3 mdio bus: probed
> [    2.229357] tsc: Refined TSC clocksource calibration: 2399.999 MHz
> [    2.244993] config state: 1292
> [    2.247136] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM57780) rev 57780001]
>         (PCI Express) MAC address 00:19:99:ce:13:a6
> [    2.249279] tg3 0000:02:00.0 eth0: attached PHY driver [Broadcom BCM57780]
>         (mii_bus:phy_addr=200:01)
> [    2.251460] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0]
>         MIirq[0] ASF[0] TSOcap[1]
> [    2.253672] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
> [...]
> [   12.204692] tg3 0000:02:00.0
>         enp2s0: No firmware running
> [   12.206653] config state: 1292
> [   12.208655] config state: 1292
> 
> That's all of the three times the new debugging line gets hit when I
> boot my system using the supplied diagnostic patch.
> 
> Hope that helps - of course, I'd gladly test any further
> (diagnostic) patches if required! Also, if I can provide any
> additional information that might be of value, just ask:-)
> 
Nils/Marcelo thanks for inputs, since reg 0x70 bit 15 is clear it
indicates the chip is not setting the config retry bit. We were hoping
this bit is causing the config access to return CRS but looks like it is
not. 

Btw after forcing the error path (tg3_init_one -> tg3_halt) in the
driver now we are able to reproduce the problem on 5722 in house. We are
working with the HW team to narrow this down.

Also it is not clear to me how reverting commit cfa6a7877b17a667 fixes
the problem.
> Greetings,
> Nils
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-19  2:10                   ` Prashant Sreedharan
@ 2014-12-19 17:09                     ` Bjorn Helgaas
  2014-12-19 17:16                       ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2014-12-19 17:09 UTC (permalink / raw)
  To: Prashant Sreedharan
  Cc: Nils Holland, Marcelo Ricardo Leitner, Michael Chan, Rajat Jain,
	David Miller, netdev, linux-pci@vger.kernel.org, Rafael Wysocki

On Thu, Dec 18, 2014 at 7:10 PM, Prashant Sreedharan
<prashant@broadcom.com> wrote:
> On Thu, 2014-12-18 at 21:26 +0100, Nils Holland wrote:
>> On Thu, Dec 18, 2014 at 11:28:09AM -0800, Prashant Sreedharan wrote:
>> > On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
>> > >
>> > > Any updates from the hardware team?
>> > >
>> > > This is a pretty serious regression, but as far as I can tell, it is
>> > > not a PCI bug.  The device should respond to a config read of vendor
>> > > ID.  If the driver does something that make the read return CRS
>> > > status, I think the driver is responsible for doing whatever delay or
>> > > other fixup is required.
>> > >
>> > > I'm inclined to reassign this bug to the tg3 driver unless you think
>> > > the PCI core is doing something wrong here.
>> > >
>> > > Bjorn
>> >
>> > We were not able to reproduce this issue, could you please check what is
>> > the value of reg 0x70, before the pci_device_is_present call is made ?
>> > if bit 15 is set config access will be retried.
>> >
>> > --- a/drivers/net/ethernet/broadcom/tg3.c
>> > +++ b/drivers/net/ethernet/broadcom/tg3.c
>> > @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>> >         void (*write_op)(struct tg3 *, u32, u32);
>> >         int i, err;
>> >
>> > +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>> >         if (!pci_device_is_present(tp->pdev))
>> >                 return -ENODEV;
>>
>> No problem, I gave this a try and here is what I get:
>>
>> [    2.185190] libphy: tg3 mdio bus: probed
>> [    2.229357] tsc: Refined TSC clocksource calibration: 2399.999 MHz
>> [    2.244993] config state: 1292
>> [    2.247136] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM57780) rev 57780001]
>>         (PCI Express) MAC address 00:19:99:ce:13:a6
>> [    2.249279] tg3 0000:02:00.0 eth0: attached PHY driver [Broadcom BCM57780]
>>         (mii_bus:phy_addr=200:01)
>> [    2.251460] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0]
>>         MIirq[0] ASF[0] TSOcap[1]
>> [    2.253672] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
>> [...]
>> [   12.204692] tg3 0000:02:00.0
>>         enp2s0: No firmware running
>> [   12.206653] config state: 1292
>> [   12.208655] config state: 1292
>>
>> That's all of the three times the new debugging line gets hit when I
>> boot my system using the supplied diagnostic patch.
>>
>> Hope that helps - of course, I'd gladly test any further
>> (diagnostic) patches if required! Also, if I can provide any
>> additional information that might be of value, just ask:-)
>>
> Nils/Marcelo thanks for inputs, since reg 0x70 bit 15 is clear it
> indicates the chip is not setting the config retry bit. We were hoping
> this bit is causing the config access to return CRS but looks like it is
> not.
>
> Btw after forcing the error path (tg3_init_one -> tg3_halt) in the
> driver now we are able to reproduce the problem on 5722 in house. We are
> working with the HW team to narrow this down.
>
> Also it is not clear to me how reverting commit cfa6a7877b17a667 fixes
> the problem.

The full commit is 89665a6a71408796565bfd29cfa6a7877b17a667, and git
works with any unique *prefix* of that.  The current convention is to
use the first 12 characters (I have "[core] abbrev = 12" in my
.git/config).  Unfortunately, suffixes don't work at all.

Anyway, here's why I think 89665a6a7140 makes a difference.  We're in this path:

  pci_device_is_present
    pci_bus_read_dev_vendor_id(..., crs_timeout = 0)
      pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l)

and for some reason the chip returns 0x00010001 for that 32-bit read.
Before 89665a6a7140, we compared all 32 bits with "*l == 0xffff0001".
This is false, so pci_bus_read_dev_vendor_id() returns true, which
means pci_device_is_present() is also true.

After 89665a6a7140, we compare only the low 16 bits with ((*l &
0xffff) == 0x0001), which is true, so pci_bus_read_dev_vendor_id()
returns false, and pci_device_is_present() is false.

Bjorn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-19 17:09                     ` Bjorn Helgaas
@ 2014-12-19 17:16                       ` Marcelo Ricardo Leitner
  2014-12-19 18:24                         ` Rajat Jain
  0 siblings, 1 reply; 24+ messages in thread
From: Marcelo Ricardo Leitner @ 2014-12-19 17:16 UTC (permalink / raw)
  To: Bjorn Helgaas, Prashant Sreedharan
  Cc: Nils Holland, Michael Chan, Rajat Jain, David Miller, netdev,
	linux-pci@vger.kernel.org, Rafael Wysocki

On 19-12-2014 15:09, Bjorn Helgaas wrote:
> On Thu, Dec 18, 2014 at 7:10 PM, Prashant Sreedharan
> <prashant@broadcom.com> wrote:
>> On Thu, 2014-12-18 at 21:26 +0100, Nils Holland wrote:
>>> On Thu, Dec 18, 2014 at 11:28:09AM -0800, Prashant Sreedharan wrote:
>>>> On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
>>>>>
>>>>> Any updates from the hardware team?
>>>>>
>>>>> This is a pretty serious regression, but as far as I can tell, it is
>>>>> not a PCI bug.  The device should respond to a config read of vendor
>>>>> ID.  If the driver does something that make the read return CRS
>>>>> status, I think the driver is responsible for doing whatever delay or
>>>>> other fixup is required.
>>>>>
>>>>> I'm inclined to reassign this bug to the tg3 driver unless you think
>>>>> the PCI core is doing something wrong here.
>>>>>
>>>>> Bjorn
>>>>
>>>> We were not able to reproduce this issue, could you please check what is
>>>> the value of reg 0x70, before the pci_device_is_present call is made ?
>>>> if bit 15 is set config access will be retried.
>>>>
>>>> --- a/drivers/net/ethernet/broadcom/tg3.c
>>>> +++ b/drivers/net/ethernet/broadcom/tg3.c
>>>> @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>>>>          void (*write_op)(struct tg3 *, u32, u32);
>>>>          int i, err;
>>>>
>>>> +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>>>>          if (!pci_device_is_present(tp->pdev))
>>>>                  return -ENODEV;
>>>
>>> No problem, I gave this a try and here is what I get:
>>>
>>> [    2.185190] libphy: tg3 mdio bus: probed
>>> [    2.229357] tsc: Refined TSC clocksource calibration: 2399.999 MHz
>>> [    2.244993] config state: 1292
>>> [    2.247136] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM57780) rev 57780001]
>>>          (PCI Express) MAC address 00:19:99:ce:13:a6
>>> [    2.249279] tg3 0000:02:00.0 eth0: attached PHY driver [Broadcom BCM57780]
>>>          (mii_bus:phy_addr=200:01)
>>> [    2.251460] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0]
>>>          MIirq[0] ASF[0] TSOcap[1]
>>> [    2.253672] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000] dma_mask[64-bit]
>>> [...]
>>> [   12.204692] tg3 0000:02:00.0
>>>          enp2s0: No firmware running
>>> [   12.206653] config state: 1292
>>> [   12.208655] config state: 1292
>>>
>>> That's all of the three times the new debugging line gets hit when I
>>> boot my system using the supplied diagnostic patch.
>>>
>>> Hope that helps - of course, I'd gladly test any further
>>> (diagnostic) patches if required! Also, if I can provide any
>>> additional information that might be of value, just ask:-)
>>>
>> Nils/Marcelo thanks for inputs, since reg 0x70 bit 15 is clear it
>> indicates the chip is not setting the config retry bit. We were hoping
>> this bit is causing the config access to return CRS but looks like it is
>> not.
>>
>> Btw after forcing the error path (tg3_init_one -> tg3_halt) in the
>> driver now we are able to reproduce the problem on 5722 in house. We are
>> working with the HW team to narrow this down.
>>
>> Also it is not clear to me how reverting commit cfa6a7877b17a667 fixes
>> the problem.
>
> The full commit is 89665a6a71408796565bfd29cfa6a7877b17a667, and git
> works with any unique *prefix* of that.  The current convention is to
> use the first 12 characters (I have "[core] abbrev = 12" in my
> .git/config).  Unfortunately, suffixes don't work at all.
>
> Anyway, here's why I think 89665a6a7140 makes a difference.  We're in this path:
>
>    pci_device_is_present
>      pci_bus_read_dev_vendor_id(..., crs_timeout = 0)
>        pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l)
>
> and for some reason the chip returns 0x00010001 for that 32-bit read.

Actually it returns just 0x00000001, but yeah, that's my understanding too.

   Marcelo

> Before 89665a6a7140, we compared all 32 bits with "*l == 0xffff0001".
> This is false, so pci_bus_read_dev_vendor_id() returns true, which
> means pci_device_is_present() is also true.
>
> After 89665a6a7140, we compare only the low 16 bits with ((*l &
> 0xffff) == 0x0001), which is true, so pci_bus_read_dev_vendor_id()
> returns false, and pci_device_is_present() is false.
>
> Bjorn
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-19 17:16                       ` Marcelo Ricardo Leitner
@ 2014-12-19 18:24                         ` Rajat Jain
  2014-12-19 18:53                           ` Prashant Sreedharan
  0 siblings, 1 reply; 24+ messages in thread
From: Rajat Jain @ 2014-12-19 18:24 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Bjorn Helgaas, Prashant Sreedharan, Nils Holland, Michael Chan,
	David Miller, netdev, linux-pci@vger.kernel.org, Rafael Wysocki

One of the reasons to replace the condition (*l == 0xffff0001) with
(*l & 0xffff) == 0x0001) was that some devices apparently returned
0001 for device id to indicate CRS, but returned actual vendor id in
the vendor ID field (hence the need to ignore vendor field).

Are we saying that the tg3 device returns 0x0001 for the device ID and
yet expects it to be treated as a good value (not CRS)?

Thanks,

Rajat

On Fri, Dec 19, 2014 at 9:16 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On 19-12-2014 15:09, Bjorn Helgaas wrote:
>>
>> On Thu, Dec 18, 2014 at 7:10 PM, Prashant Sreedharan
>> <prashant@broadcom.com> wrote:
>>>
>>> On Thu, 2014-12-18 at 21:26 +0100, Nils Holland wrote:
>>>>
>>>> On Thu, Dec 18, 2014 at 11:28:09AM -0800, Prashant Sreedharan wrote:
>>>>>
>>>>> On Thu, 2014-12-18 at 12:15 -0700, Bjorn Helgaas wrote:
>>>>>>
>>>>>>
>>>>>> Any updates from the hardware team?
>>>>>>
>>>>>> This is a pretty serious regression, but as far as I can tell, it is
>>>>>> not a PCI bug.  The device should respond to a config read of vendor
>>>>>> ID.  If the driver does something that make the read return CRS
>>>>>> status, I think the driver is responsible for doing whatever delay or
>>>>>> other fixup is required.
>>>>>>
>>>>>> I'm inclined to reassign this bug to the tg3 driver unless you think
>>>>>> the PCI core is doing something wrong here.
>>>>>>
>>>>>> Bjorn
>>>>>
>>>>>
>>>>> We were not able to reproduce this issue, could you please check what
>>>>> is
>>>>> the value of reg 0x70, before the pci_device_is_present call is made ?
>>>>> if bit 15 is set config access will be retried.
>>>>>
>>>>> --- a/drivers/net/ethernet/broadcom/tg3.c
>>>>> +++ b/drivers/net/ethernet/broadcom/tg3.c
>>>>> @@ -9025,6 +9025,7 @@ static int tg3_chip_reset(struct tg3 *tp)
>>>>>          void (*write_op)(struct tg3 *, u32, u32);
>>>>>          int i, err;
>>>>>
>>>>> +       printk(KERN_ERR "config state: %x\n", tr32(TG3PCI_PCISTATE));
>>>>>          if (!pci_device_is_present(tp->pdev))
>>>>>                  return -ENODEV;
>>>>
>>>>
>>>> No problem, I gave this a try and here is what I get:
>>>>
>>>> [    2.185190] libphy: tg3 mdio bus: probed
>>>> [    2.229357] tsc: Refined TSC clocksource calibration: 2399.999 MHz
>>>> [    2.244993] config state: 1292
>>>> [    2.247136] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM57780) rev
>>>> 57780001]
>>>>          (PCI Express) MAC address 00:19:99:ce:13:a6
>>>> [    2.249279] tg3 0000:02:00.0 eth0: attached PHY driver [Broadcom
>>>> BCM57780]
>>>>          (mii_bus:phy_addr=200:01)
>>>> [    2.251460] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0]
>>>>          MIirq[0] ASF[0] TSOcap[1]
>>>> [    2.253672] tg3 0000:02:00.0 eth0: dma_rwctrl[76180000]
>>>> dma_mask[64-bit]
>>>> [...]
>>>> [   12.204692] tg3 0000:02:00.0
>>>>          enp2s0: No firmware running
>>>> [   12.206653] config state: 1292
>>>> [   12.208655] config state: 1292
>>>>
>>>> That's all of the three times the new debugging line gets hit when I
>>>> boot my system using the supplied diagnostic patch.
>>>>
>>>> Hope that helps - of course, I'd gladly test any further
>>>> (diagnostic) patches if required! Also, if I can provide any
>>>> additional information that might be of value, just ask:-)
>>>>
>>> Nils/Marcelo thanks for inputs, since reg 0x70 bit 15 is clear it
>>> indicates the chip is not setting the config retry bit. We were hoping
>>> this bit is causing the config access to return CRS but looks like it is
>>> not.
>>>
>>> Btw after forcing the error path (tg3_init_one -> tg3_halt) in the
>>> driver now we are able to reproduce the problem on 5722 in house. We are
>>> working with the HW team to narrow this down.
>>>
>>> Also it is not clear to me how reverting commit cfa6a7877b17a667 fixes
>>> the problem.
>>
>>
>> The full commit is 89665a6a71408796565bfd29cfa6a7877b17a667, and git
>> works with any unique *prefix* of that.  The current convention is to
>> use the first 12 characters (I have "[core] abbrev = 12" in my
>> .git/config).  Unfortunately, suffixes don't work at all.
>>
>> Anyway, here's why I think 89665a6a7140 makes a difference.  We're in this
>> path:
>>
>>    pci_device_is_present
>>      pci_bus_read_dev_vendor_id(..., crs_timeout = 0)
>>        pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l)
>>
>> and for some reason the chip returns 0x00010001 for that 32-bit read.
>
>
> Actually it returns just 0x00000001, but yeah, that's my understanding too.
>
>   Marcelo
>
>
>> Before 89665a6a7140, we compared all 32 bits with "*l == 0xffff0001".
>> This is false, so pci_bus_read_dev_vendor_id() returns true, which
>> means pci_device_is_present() is also true.
>>
>> After 89665a6a7140, we compare only the low 16 bits with ((*l &
>> 0xffff) == 0x0001), which is true, so pci_bus_read_dev_vendor_id()
>> returns false, and pci_device_is_present() is false.
>>
>> Bjorn
>>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-19 18:24                         ` Rajat Jain
@ 2014-12-19 18:53                           ` Prashant Sreedharan
  2014-12-19 19:37                             ` Rajat Jain
  0 siblings, 1 reply; 24+ messages in thread
From: Prashant Sreedharan @ 2014-12-19 18:53 UTC (permalink / raw)
  To: rajatxjain
  Cc: Marcelo Ricardo Leitner, Bjorn Helgaas, Nils Holland,
	Michael Chan, David Miller, netdev, linux-pci@vger.kernel.org,
	Rafael Wysocki

On Fri, 2014-12-19 at 10:24 -0800, Rajat Jain wrote:
> One of the reasons to replace the condition (*l == 0xffff0001) with
> (*l & 0xffff) == 0x0001) was that some devices apparently returned
> 0001 for device id to indicate CRS, but returned actual vendor id in
> the vendor ID field (hence the need to ignore vendor field).
> 
> Are we saying that the tg3 device returns 0x0001 for the device ID and
> yet expects it to be treated as a good value (not CRS)?
> 
No it should not be treated as a good value, this commit has
surfaced/exposed a problem in 5722 chipset which was previously masked. 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-19 18:53                           ` Prashant Sreedharan
@ 2014-12-19 19:37                             ` Rajat Jain
  0 siblings, 0 replies; 24+ messages in thread
From: Rajat Jain @ 2014-12-19 19:37 UTC (permalink / raw)
  To: Prashant Sreedharan
  Cc: Marcelo Ricardo Leitner, Bjorn Helgaas, Nils Holland,
	Michael Chan, David Miller, netdev, linux-pci@vger.kernel.org,
	Rafael Wysocki

On Fri, Dec 19, 2014 at 10:53 AM, Prashant Sreedharan
<prashant@broadcom.com> wrote:
> On Fri, 2014-12-19 at 10:24 -0800, Rajat Jain wrote:
>> One of the reasons to replace the condition (*l == 0xffff0001) with
>> (*l & 0xffff) == 0x0001) was that some devices apparently returned
>> 0001 for device id to indicate CRS, but returned actual vendor id in
>> the vendor ID field (hence the need to ignore vendor field).
>>
>> Are we saying that the tg3 device returns 0x0001 for the device ID and
>> yet expects it to be treated as a good value (not CRS)?
>>
> No it should not be treated as a good value, this commit has
> surfaced/exposed a problem in 5722 chipset which was previously masked.
>

Got it. Thanks.

I assume you mean its a HW issue, and the workaround if required shall
be taken care in tg3 driver.

Thanks,

Rajat

>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-16 16:04   ` Rajat Jain
  2014-12-16 16:20     ` Bjorn Helgaas
@ 2014-12-16 18:00     ` Marcelo Ricardo Leitner
  2014-12-16 20:38       ` Nils Holland
  1 sibling, 1 reply; 24+ messages in thread
From: Marcelo Ricardo Leitner @ 2014-12-16 18:00 UTC (permalink / raw)
  To: rajatxjain; +Cc: Nils Holland, David Miller, netdev, linux-pci@vger.kernel.org

On 16-12-2014 14:04, Rajat Jain wrote:
> Hello All,
>
> Apologies for jumping in late, but for some reason I do not see the
> original mail in my inbox. However I am taking a look at the mails as
> sent on linux-pci (and I will keep an eye out for the bug report that
> Bjorn asked for).
>

np!
Nils would you create that BZ please? As you did all the bisect.. :)

>
>>
>> I'm getting, with commit 89665a6a71408796565bfd29cfa6a7877b17a667:
>>
>> $ grep 'pci 0000:02' tg3.bad
>> [    0.190733] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190736] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190810] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
>> [    0.190885] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
>> [    0.191048] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
>> [    0.191382] pci 0000:02:00.0: PME# supported from D3hot D3cold
>> [    0.191438] pci 0000:02:00.0: System wakeup disabled by ACPI
>> [    1.561555] pci 0000:02:00.0: 1st 1 1
>> [    1.561558] pci 0000:02:00.0: crs_timeout: 0
>> [   20.412021] pci 0000:02:00.0: 1st 1 1
>> [   20.412022] pci 0000:02:00.0: crs_timeout: 0
>> [   20.413596] pci 0000:02:00.0: 1st 1 1
>> [   20.413598] pci 0000:02:00.0: crs_timeout: 0
>>
>> And without it:
>>
>> $ grep 'pci 0000:02' tg3.good
>> [    0.190734] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190738] pci 0000:02:00.0: 1st 165a14e4 14e4
>> [    0.190811] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
>> [    0.190884] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
>> [    0.191047] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
>> [    0.191380] pci 0000:02:00.0: PME# supported from D3hot D3cold
>> [    0.191439] pci 0000:02:00.0: System wakeup disabled by ACPI
>> [    1.576778] pci 0000:02:00.0: 1st 1 1
>> [   19.068517] pci 0000:02:00.0: 1st 165a14e4 14e4
>>
>
> It seems that in the first 2 attempts that were made to probe the
> device are all OK and return regular device ID and vendor ID for TG3
> (CRS does not have a role to play). However, later attempts return a
> CRS.
>
> 1) May I ask if you are using acpihp or pciehp? I assume pciehp?

Well.. system doesn't support hotplug..
Chipset is a "Intel Corporation 5 Series/3400 Series", fwiw

> 2) Can you please also send dmesg output while passing
> pciehp.pciehp_debug=1? In the fail case, do you see a message
> indicating the pciehp gave up since it got CRS for a long time
> (something like "pci 0000:02:00.0 id reading try 50 times with
> interval 20 ms to get ffff0001")?

I did use that option anyway, but it resulted in no new messages.

> 3) Currently the pciehp passes "0" for the argument "crs_timeout" to
> pci_bus_read_dev_vendor_id(). Can you please try increasing it to, say
> 30 seconds (30 * 1000). (For comparison data, acpihp uses the value
> 60*1000 i.e. 60 seconds today) and run the fail case once again?
>
> Thanks a lot in advance for the debugging help ;-)
>

Seems it's not safe to do that with those backtraces..
I did it, system was very slow to boot, still didn't get the NIC on and 
got a bunch of "scheduling while atomic" due to that msleep() call.

The first invoke was fine:
Dec 16 15:40:00 odin kernel: [    0.190711] pci 0000:02:00.0: 1st 
165a14e4 14e4
Dec 16 15:40:00 odin kernel: [    0.190717] pci 0000:02:00.0: 1st 
165a14e4 14e4
Dec 16 15:40:00 odin kernel: [    0.191091] pci 0000:02:00.0: System 
wakeup disabled by ACPI
Dec 16 15:40:00 odin kernel: [    1.576061] pci 0000:02:00.0: 1st 1 1
Dec 16 15:40:00 odin kernel: [    1.577474] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.580487] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.585508] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.594499] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.611499] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.644521] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.709566] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    1.838654] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    2.095765] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    2.608956] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    3.634443] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    5.684388] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [    9.783279] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [   17.980060] pci 0000:02:00.0: 1 1
Dec 16 15:40:00 odin kernel: [   34.372640] pci 0000:02:00.0: not responding

The other two...
Dec 16 15:40:09 odin kernel: [   54.154688] pci 0000:02:00.0: 1st 1 1
Dec 16 15:40:09 odin kernel: [   54.154690] BUG: scheduling while 
atomic: ip/1575/0x00000200
Dec 16 15:40:09 odin kernel: pci 0000:02:00.0: 1st 1 1
Dec 16 15:40:09 odin kernel: BUG: scheduling while atomic: 
ip/1575/0x00000200
Dec 16 15:40:09 odin kernel: pci 0000:02:00.0: 1 1
Dec 16 15:40:09 odin kernel: BUG: scheduling while atomic: 
ip/1575/0x00000200
(...)

BUG backtraces were very similar to the 2nd and 3rd I posted on the 
other email, it just pointed to the msleep() call instead of my BUG_ON(1).

I can dig deeper if you think it's worth, but as the 1st call didn't 
have this issue and it didn't complete either, seems we are good about 
the test.. right?

Thanks,
Marcelo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-16 18:00     ` Marcelo Ricardo Leitner
@ 2014-12-16 20:38       ` Nils Holland
  0 siblings, 0 replies; 24+ messages in thread
From: Nils Holland @ 2014-12-16 20:38 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: rajatxjain, David Miller, netdev, linux-pci@vger.kernel.org

On Tue, Dec 16, 2014 at 04:00:04PM -0200, Marcelo Ricardo Leitner wrote:
> On 16-12-2014 14:04, Rajat Jain wrote:
> > Hello All,
> >
> > Apologies for jumping in late, but for some reason I do not see the
> > original mail in my inbox. However I am taking a look at the mails as
> > sent on linux-pci (and I will keep an eye out for the bug report that
> > Bjorn asked for).
> >
> 
> np!
> Nils would you create that BZ please? As you did all the bisect.. :)

I'm only just catching up with all the new messages in this thread,
but I've already opened a bug report, as requested. Come and find it
at:

https://bugzilla.kernel.org/show_bug.cgi?id=89821

Feel free to add more relevant details to it! :-)

Greetings,
Nils

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-13 21:02 [bisected] tg3 broken in 3.18.0? Nils Holland
  2014-12-15 15:06 ` Marcelo Ricardo Leitner
@ 2014-12-16  0:31 ` Bjorn Helgaas
  1 sibling, 0 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2014-12-16  0:31 UTC (permalink / raw)
  To: Nils Holland; +Cc: David Miller, netdev, linux-pci@vger.kernel.org, Rajat Jain

On Sat, Dec 13, 2014 at 2:02 PM, Nils Holland <nholland@tisys.org> wrote:
> rajatxjain@gmail.com
> Bcc:
> Subject: Re: [bisected] tg3 broken in 3.18.0?
> Reply-To:
> In-Reply-To: <20141212.201831.186234837340644301.davem@davemloft.net>
>
> On Fri, Dec 12, 2014 at 08:18:31PM -0500, David Miller wrote:
>> From: Nils Holland <nholland@tisys.org>
>> Date: Sat, 13 Dec 2014 02:14:08 +0100
>>
>> >
>> > My bisect exercise suggests that the following commit is the culprit:
>> >
>> > 89665a6a71408796565bfd29cfa6a7877b17a667 (PCI: Check only the Vendor
>> > ID to identify Configuration Request Retry)
>>
>> You definitely need to bring this up with the author of that change
>> and the relevent list for the PCI subsystem and/or linux-kernel.
>
> I've now already sent an inquiry to Rajat Jain, the author of the
> patch in question, and this message here is now also CC'd to
> linux-pci@.
>
> With this message, I'd like to add one last result of investigation
> I've done today, in the hope that it will aid the folks with more
> knowledge to go after the issue.
>
> Basically, I've added a little debug output to tg3.c in the function
> tg3_poll_fw(), as that function contained the code that would print
> out the "No firmware running" line that was visible in dmesg on those
> kernels where tg3 would not work for me. So, I basically had this:
>
> static int tg3_poll_fw(struct tg3 *tp)
> {
>         int i;
>         u32 val;
>
>         netdev_info(tp->dev, "XX: Boom!\n");
>         [...]
> }
>
> Now, I was looking through dmesg searching for occurances of this
> debug output, using a standard 3.18.0 kernel (where my tg3 doesn't
> work) as well as using a 3.18.0 kernel with
> 89665a6a71408796565bfd29cfa6a7877b17a667 reverted (where my tg3
> works). Here's the results:
>
> [standard 3.18.0 (=problematic)]:
> [    2.197653] libphy: tg3 mdio bus: probed
> [    2.257488] tg3 0000:02:00.0 eth0:
>         Tigon3 [partno(BCM57780) rev 57780001] (PCI Express) MAC address
>         00:19:99:ce:13:a6
> [    2.259589] tg3 0000:02:00.0 eth0:
>         attached PHY driver [Broadcom BCM57780] (mii_bus:phy_addr=200:01)
> [    2.261740] tg3 0000:02:00.0 eth0:
>         RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
> [    2.263912] tg3 0000:02:00.0 eth0:
>         dma_rwctrl[76180000] dma_mask[64-bit]
> [...]
> [   10.028002] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
> [   10.028247] tg3 0000:02:00.0 enp2s0: XX: Boom!
> [   12.157034] tg3 0000:02:00.0 enp2s0: No firmware running
>
>
> [3.18.0 without above mentioned patch, 3.17.3 is the same, both result
> in a working tg3]:
> [    1.397167] libphy: tg3 mdio bus: probed
> [    1.456473] tg3 0000:02:00.0
>         (unnamed net_device) (uninitialized): XX: Boom!
> [    1.464987] tg3 0000:02:00.0 eth0:
>         Tigon3 [partno(BCM57780) rev 57780001] (PCI Express) MAC address
>         00:19:99:ce:13:a6
> [    1.467118] tg3 0000:02:00.0 eth0:
>         attached PHY driver [Broadcom BCM57780] (mii_bus:phy_addr=200:01)
> [    1.469311] tg3 0000:02:00.0 eth0:
>         RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
> [    1.471500] tg3 0000:02:00.0 eth0:
>         dma_rwctrl[76180000] dma_mask[64-bit]
> [...]
> [    9.631629] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
> [    9.631962] tg3 0000:02:00.0 enp2s0: XX: Boom!
> [    9.634339] tg3 0000:02:00.0 enp2s0: XX: Boom!
> [    9.642741] IPv6:
>         ADDRCONF(NETDEV_UP): enp2s0: link is not ready
> [   10.479636] tg3 0000:02:00.0
>         enp2s0: Link is down
> [   11.484498] tg3 0000:02:00.0
>         enp2s0: Link is up at 100 Mbps, full duplex
>
> As can be seen, there are two tg3-related sections in my dmesg in both
> the working and non-working scenarios: At about 1 - 2 secs, the card
> seems to begin initializing, and at about 9 - 10 seconds it is (or
> should be) ready to establish a network connection.
>
> My debug section, or tg3.c's tg3_poll_fw(), seems to be called thrice
> in the working situation: The first hit occurs at 1.456473 where the tg3
> device is still reported as "(unnamed net_device) (uninitialized)".
> Then, the section gets hit twice again at around 9.63 - at this point
> the driver already reports the card as initialized / by its real name.
>
> In the non-working situation, the debug sections seems to be hit only
> once, at 10.028247. At this point, the tg3 is already reported as
> initialized - just like when it's hit the second and third time in the
> working situation.
>
> Bottom line is that commit 89665a6a71408796565bfd29cfa6a7877b17a667
> really makes a difference regarding the way the tg3 card is
> initialized, which seems to cause the problem.

Hi Nils,

Thanks a lot for the bug report.  Can you open a bugzilla at
http://bugzilla.kernel.org, put it in the drivers/PCI component, mark
it as a regression, and attach the complete dmesg log for both the
working and non-working cases, as well as "lspci -vv" output for the
working case?

I don't yet see how 89665a6a7140 makes a difference here.  We must
eventually read PCI_VENDOR_ID_BROADCOM (0x14e4) because the tg3 driver
claimed the device.

Can you still reproduce the problem if you print out the value of "l"
every time we read PCI_VENDOR_ID in pci_bus_read_dev_vendor_id()?
That will change the timing, so it's possible that will make it harder
to reproduce.

Bjorn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* tg3 broken in 3.18.0?
@ 2014-12-10 23:06 Nils Holland
  2014-12-11 16:45 ` Marcelo Ricardo Leitner
  0 siblings, 1 reply; 24+ messages in thread
From: Nils Holland @ 2014-12-10 23:06 UTC (permalink / raw)
  To: netdev

Hi everyone,

I just upgraded a machine from 3.17.3 to 3.18.0 and noticed that after
the upgrade, the machine's network interface (which is a tg3) would no
longer run correctly (or, for that matter, run at all). During the
upgrade, I didn't change any kernel config options or any other parts
of the system.

Since the machine is remote and I don't have direct access to it, it's
kind of hard currently to give more details, but here's what I'm
seeing in the logs:

[Booting 3.17.3:]
[    1.383151] tg3.c:v3.137 (May 11, 2014)
[    1.387296] libphy: tg3 mdio bus: probed
[    1.452600] tg3 0000:02:00.0 eth0:
        Tigon3 [partno(BCM57780) rev 57780001] (PCI Express) MAC address
        00:19:99:ce:13:a6
[    1.454660] tg3 0000:02:00.0 eth0:
        attached PHY driver [Broadcom BCM57780] (mii_bus:phy_addr=200:01)
[    1.456764] tg3 0000:02:00.0 eth0:
        RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[    1.458911] tg3 0000:02:00.0 eth0:
        dma_rwctrl[76180000] dma_mask[64-bit]
[...]
[    6.602608] tg3 0000:02:00.0
        enp2s0: renamed from eth0
[    9.865638] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
[    9.887584] IPv6:
        ADDRCONF(NETDEV_UP): enp2s0: link is not ready
[   10.469819] tg3 0000:02:00.0
        enp2s0: Link is down
[   12.477396] tg3 0000:02:00.0
        enp2s0: Link is up at 100 Mbps, full duplex
[   12.477404] tg3 0000:02:00.0
        enp2s0: Flow control is off for TX and off for RX

[Booting 3.18.0:]
[    2.192915] tg3.c:v3.137 (May 11, 2014)
[    2.196767] libphy: tg3 mdio bus: probed
[    2.256294] tg3 0000:02:00.0 eth0:
        Tigon3 [partno(BCM57780) rev 57780001] (PCI Express) MAC address
        00:19:99:ce:13:a6
[    2.258387] tg3 0000:02:00.0 eth0:
        attached PHY driver [Broadcom BCM57780] (mii_bus:phy_addr=200:01)
[    2.260530] tg3 0000:02:00.0 eth0:
        RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[    2.262679] tg3 0000:02:00.0 eth0:
        dma_rwctrl[76180000] dma_mask[64-bit]
[...]
[    7.431176] tg3 0000:02:00.0
        enp2s0: renamed from eth0
[   10.422839] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
[   12.530363] tg3 0000:02:00.0
        enp2s0: No firmware running

That's the last thing I find about the card in the logs, the machine
will then just sit there, working normally but being unreachable from
the network.

If I see things correctly, there were only two patches affecting tg3
between 3.17(.3) and 3.18:

2c7c9ea429ba30fe506747b7da110e2212d8fefa
a620a6bc1c94c22d6c312892be1e0ae171523125

The affected machine being, like I said, remote, I've not yet been
able to do more thorough tests. So I thought I'd report the issue and
see if someone else has also seen it already, or can test things with
a more easily accesible machine. Otherwise, I might start digging
deeper.

Greetings,
Nils

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: tg3 broken in 3.18.0?
  2014-12-10 23:06 Nils Holland
@ 2014-12-11 16:45 ` Marcelo Ricardo Leitner
  2014-12-12 14:50   ` Jonathan Bither
  0 siblings, 1 reply; 24+ messages in thread
From: Marcelo Ricardo Leitner @ 2014-12-11 16:45 UTC (permalink / raw)
  To: Nils Holland, netdev

On 10-12-2014 21:06, Nils Holland wrote:
> Hi everyone,
>
> I just upgraded a machine from 3.17.3 to 3.18.0 and noticed that after
> the upgrade, the machine's network interface (which is a tg3) would no
> longer run correctly (or, for that matter, run at all). During the
> upgrade, I didn't change any kernel config options or any other parts
> of the system.

Same thing here! Thanks for reporting this, Nils.

> Since the machine is remote and I don't have direct access to it, it's
> kind of hard currently to give more details, but here's what I'm
> seeing in the logs:

I have access to mine, kudos to secondary NIC.

$ ethtool -i p1p1
driver: tg3
version: 3.137
firmware-version: 5722-v3.13
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

$ ethtool p1p1
Settings for p1p1:
         Supported ports: [ TP ]
         Supported link modes:   10baseT/Half 10baseT/Full
                                 100baseT/Half 100baseT/Full
                                 1000baseT/Half 1000baseT/Full
         Supported pause frame use: No
         Supports auto-negotiation: Yes
         Advertised link modes:  10baseT/Half 10baseT/Full
                                 100baseT/Half 100baseT/Full
                                 1000baseT/Half 1000baseT/Full
         Advertised pause frame use: Symmetric
         Advertised auto-negotiation: Yes
         Speed: Unknown!
         Duplex: Unknown! (255)
         Port: Twisted Pair
         PHYAD: 1
         Transceiver: internal
         Auto-negotiation: on
         MDI-X: Unknown

$ sudo ip link set p1p1 up
RTNETLINK answers: No such device

> If I see things correctly, there were only two patches affecting tg3
> between 3.17(.3) and 3.18:
>
> 2c7c9ea429ba30fe506747b7da110e2212d8fefa
> a620a6bc1c94c22d6c312892be1e0ae171523125

I'm running net-next, 395eea6ccf2b253f81b4718ffbcae67d36fe2e69.
So my diffs would be:
$ git log v3.17..origin/master --oneline -- drivers/net/ethernet/broadcom/tg3.c
892311f ethtool: Support for configurable RSS hash function
60b7379 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
a620a6b tg3: fix ring init when there are more TX than RX channels
3964835 tg3: use netdev_rss_key_fill() helper
2c7c9ea tg3: Add skb->xmit_more support

Reverting all these, issue continues.

If no one has a better shot, I'll try bissecting later.

   Marcelo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: tg3 broken in 3.18.0?
  2014-12-11 16:45 ` Marcelo Ricardo Leitner
@ 2014-12-12 14:50   ` Jonathan Bither
  2014-12-12 20:31     ` Nils Holland
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Bither @ 2014-12-12 14:50 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Nils Holland, netdev

Not sure if it helps any, but tg3 works here after a 3.18 upgrade. I'd 
be happy to share any information if it would help you out.

[root@www ~]# uname -a
Linux localhost 3.18.0-1.el6.elrepo.i686 #1 SMP Mon Dec 8 10:55:34 EST 
2014 i686 i686 i386 GNU/Linux
[root@www ~]# ethtool -i eth0
driver: tg3
version: 3.137
firmware-version: 5704-v3.36, ASFIPMIc v2.37
bus-info: 0000:02:03.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
[root@www ~]# ethtool eth0
Settings for eth0:
	Supported ports: [ TP ]
	Supported link modes:   10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Half 1000baseT/Full
	Supported pause frame use: No
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full
	                        100baseT/Half 100baseT/Full
	                        1000baseT/Half 1000baseT/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Link partner advertised link modes:  10baseT/Half 10baseT/Full
	                                     100baseT/Half 100baseT/Full
	                                     1000baseT/Full
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: Yes
	Speed: 1000Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 1
	Transceiver: internal
	Auto-negotiation: on
	MDI-X: on
	Supports Wake-on: g
	Wake-on: g
	Current message level: 0x000000ff (255)
			       drv probe link timer ifdown ifup rx_err tx_err
	Link detected: yes
[root@www ~]#



On 12/11/2014 11:45 AM, Marcelo Ricardo Leitner wrote:
> On 10-12-2014 21:06, Nils Holland wrote:
>> Hi everyone,
>>
>> I just upgraded a machine from 3.17.3 to 3.18.0 and noticed that after
>> the upgrade, the machine's network interface (which is a tg3) would no
>> longer run correctly (or, for that matter, run at all). During the
>> upgrade, I didn't change any kernel config options or any other parts
>> of the system.
>
> Same thing here! Thanks for reporting this, Nils.
>
>> Since the machine is remote and I don't have direct access to it, it's
>> kind of hard currently to give more details, but here's what I'm
>> seeing in the logs:
>
> I have access to mine, kudos to secondary NIC.
>
> $ ethtool -i p1p1
> driver: tg3
> version: 3.137
> firmware-version: 5722-v3.13
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
>
> $ ethtool p1p1
> Settings for p1p1:
>          Supported ports: [ TP ]
>          Supported link modes:   10baseT/Half 10baseT/Full
>                                  100baseT/Half 100baseT/Full
>                                  1000baseT/Half 1000baseT/Full
>          Supported pause frame use: No
>          Supports auto-negotiation: Yes
>          Advertised link modes:  10baseT/Half 10baseT/Full
>                                  100baseT/Half 100baseT/Full
>                                  1000baseT/Half 1000baseT/Full
>          Advertised pause frame use: Symmetric
>          Advertised auto-negotiation: Yes
>          Speed: Unknown!
>          Duplex: Unknown! (255)
>          Port: Twisted Pair
>          PHYAD: 1
>          Transceiver: internal
>          Auto-negotiation: on
>          MDI-X: Unknown
>
> $ sudo ip link set p1p1 up
> RTNETLINK answers: No such device
>
>> If I see things correctly, there were only two patches affecting tg3
>> between 3.17(.3) and 3.18:
>>
>> 2c7c9ea429ba30fe506747b7da110e2212d8fefa
>> a620a6bc1c94c22d6c312892be1e0ae171523125
>
> I'm running net-next, 395eea6ccf2b253f81b4718ffbcae67d36fe2e69.
> So my diffs would be:
> $ git log v3.17..origin/master --oneline --
> drivers/net/ethernet/broadcom/tg3.c
> 892311f ethtool: Support for configurable RSS hash function
> 60b7379 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> a620a6b tg3: fix ring init when there are more TX than RX channels
> 3964835 tg3: use netdev_rss_key_fill() helper
> 2c7c9ea tg3: Add skb->xmit_more support
>
> Reverting all these, issue continues.
>
> If no one has a better shot, I'll try bissecting later.
>
>    Marcelo
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: tg3 broken in 3.18.0?
  2014-12-12 14:50   ` Jonathan Bither
@ 2014-12-12 20:31     ` Nils Holland
  2014-12-13  1:14       ` [bisected] " Nils Holland
  0 siblings, 1 reply; 24+ messages in thread
From: Nils Holland @ 2014-12-12 20:31 UTC (permalink / raw)
  To: Jonathan Bither; +Cc: netdev

On Fri, Dec 12, 2014 at 09:50:53AM -0500, Jonathan Bither wrote:
> Not sure if it helps any, but tg3 works here after a 3.18 upgrade. I'd 
> be happy to share any information if it would help you out.

What I get here is this (output captured under 3.17.3):

triton513 ~ # ethtool -i enp2s0
driver: tg3
version: 3.137
firmware-version: sb
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

So, you, Marcelo and me, we all seem to have different firmware
versions. If I'm correct, different versions of tg3 exist that either
contain the firmware onboard or get it injected at driver
initialization (correct me if I'm wrong). If Marcelo's and my fw
version had been the same this might have given a clue, but nope.

I'm putting my faith in Marcelo bisecting and finding out more details.
I might try that as well over the weekend, at least to the extent
possible without ever being able to have live access to the machine
when it is running a kernel exhibiting this issue.

Greetings,
Nils

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [bisected] tg3 broken in 3.18.0?
  2014-12-12 20:31     ` Nils Holland
@ 2014-12-13  1:14       ` Nils Holland
  2014-12-13  1:18         ` David Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Nils Holland @ 2014-12-13  1:14 UTC (permalink / raw)
  To: netdev

Ok folks,

I now took the time to bisect the issue that killed the tg3 network
interface on one of my boxes in 3.18.0. Beside me, at least one other
person was affected, although we also have a confirmed report of
another person using tg3 without issues under 3.18.0.

My bisect exercise suggests that the following commit is the culprit:

89665a6a71408796565bfd29cfa6a7877b17a667 (PCI: Check only the Vendor
ID to identify Configuration Request Retry)

In case that rings a bell for anyone, I'd be more than glad to hear
about it! Otherwise, while I'm no expert at this, I'll do some more
investigations tomorrow. It's gotten kind of late during bisecting and
I'm off for some sleep now. ;-)

Greetings,
Nils

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [bisected] tg3 broken in 3.18.0?
  2014-12-13  1:14       ` [bisected] " Nils Holland
@ 2014-12-13  1:18         ` David Miller
  0 siblings, 0 replies; 24+ messages in thread
From: David Miller @ 2014-12-13  1:18 UTC (permalink / raw)
  To: nholland; +Cc: netdev

From: Nils Holland <nholland@tisys.org>
Date: Sat, 13 Dec 2014 02:14:08 +0100

> Ok folks,
> 
> I now took the time to bisect the issue that killed the tg3 network
> interface on one of my boxes in 3.18.0. Beside me, at least one other
> person was affected, although we also have a confirmed report of
> another person using tg3 without issues under 3.18.0.
> 
> My bisect exercise suggests that the following commit is the culprit:
> 
> 89665a6a71408796565bfd29cfa6a7877b17a667 (PCI: Check only the Vendor
> ID to identify Configuration Request Retry)
> 
> In case that rings a bell for anyone, I'd be more than glad to hear
> about it! Otherwise, while I'm no expert at this, I'll do some more
> investigations tomorrow. It's gotten kind of late during bisecting and
> I'm off for some sleep now. ;-)

You definitely need to bring this up with the author of that change
and the relevent list for the PCI subsystem and/or linux-kernel.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-12-19 19:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-13 21:02 [bisected] tg3 broken in 3.18.0? Nils Holland
2014-12-15 15:06 ` Marcelo Ricardo Leitner
2014-12-16 16:04   ` Rajat Jain
2014-12-16 16:20     ` Bjorn Helgaas
2014-12-16 17:15       ` Michael Chan
2014-12-16 17:59         ` Marcelo Ricardo Leitner
2014-12-16 19:54           ` Michael Chan
2014-12-16 20:02             ` Marcelo Ricardo Leitner
2014-12-18 19:15             ` Bjorn Helgaas
2014-12-18 19:28               ` Prashant Sreedharan
2014-12-18 20:09                 ` Marcelo Ricardo Leitner
2014-12-18 20:33                   ` Marcelo Ricardo Leitner
2014-12-18 20:26                 ` Nils Holland
2014-12-19  2:10                   ` Prashant Sreedharan
2014-12-19 17:09                     ` Bjorn Helgaas
2014-12-19 17:16                       ` Marcelo Ricardo Leitner
2014-12-19 18:24                         ` Rajat Jain
2014-12-19 18:53                           ` Prashant Sreedharan
2014-12-19 19:37                             ` Rajat Jain
2014-12-16 18:00     ` Marcelo Ricardo Leitner
2014-12-16 20:38       ` Nils Holland
2014-12-16  0:31 ` Bjorn Helgaas
  -- strict thread matches above, loose matches on Subject: below --
2014-12-10 23:06 Nils Holland
2014-12-11 16:45 ` Marcelo Ricardo Leitner
2014-12-12 14:50   ` Jonathan Bither
2014-12-12 20:31     ` Nils Holland
2014-12-13  1:14       ` [bisected] " Nils Holland
2014-12-13  1:18         ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).