From: Sean Anderson <sean.anderson@linux.dev>
To: Andrew Lunn <andrew@lunn.ch>
Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>,
Alex Williams <alex.williams@ni.com>,
Andi Shyti <andi.shyti@kernel.org>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
linux-i2c@vger.kernel.org, Michal Simek <michal.simek@amd.com>,
Heiner Kallweit <hkallweit1@gmail.com>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [BUG] SFP I2C timeout forces link down with PHY_ERROR
Date: Thu, 30 May 2024 12:56:18 -0400 [thread overview]
Message-ID: <93e8839d-e712-4708-a2ca-df81051b8360@linux.dev> (raw)
In-Reply-To: <1398a492-95aa-46d9-b52b-a374fd6e9e77@lunn.ch>
On 5/28/24 14:14, Andrew Lunn wrote:
> On Tue, May 28, 2024 at 01:52:56PM -0400, Sean Anderson wrote:
>> (forgot to CC Alex)
>>
>> On 5/28/24 13:50, Sean Anderson wrote:
>> > On 5/28/24 13:28, Russell King (Oracle) wrote:
>> >> First, note that phylib's policy is if it loses comms with the PHY,
>> >> then the link will be forced down. This is out of control of the SFP
>> >> or phylink code.
>> >>
>> >> I've seen bugs with the I2C emulation on some modules resulting in
>> >> problems with various I2C controllers.
>> >>
>> >> Sometimes the problem is due to a bad I2C level shifter. Some I2C
>> >> level shifter manufacturers will swear blind that their shifter
>> >> doesn't lock up, but strangely, one can prove with an osciloscope
>> >> that it _does_ lock up - and in a way that the only way to recover
>> >> was to possibly unplug the module or poewr cycle the platform.
>> >
>> > Well, I haven't seen any case where the bus locks up. I've been able to
>> > recover just by doing
>> >
>> > ip link set net0 down
>> > ip link set net0 up
>> >
>> > which suggests that this is just a transient problem.
>
> If you look back over the history, i don't think you will find any
> reports to transient problems with real MDIO busses. Hence any error
> is considered fatal. Also, when you consider the design of MDIO, it is
> actually very hard for an error to be detected. It is basically a
> shift register, shifting out 64 bits for a write, or 48 bits for a
> read, followed by receiving 16 bits for a read. There is no protocol
> to indicate any sort of error. If there is no device at the address,
> the pullup means you receive 1s. End of story.
Yes, I would expect the only time there could be transient problems
would be with external MII (such as if someone jiggled the phy).
> With MDIO over I2C, it is I2C which has problems, not MDIO. Do you
> expect transient problems with I2C?
Well, I2C is known to have devices which can get stuck and hang the bus
(generally requiring some bit-banging from Linux to get things unstuck,
or a reset of the device). So while I2C (like MDIO) is supposed to be
completely reliable, there is a history of it being not quite perfect.
That said, I did not expect to see these kinds of errors at all. I'll
have a closer look at the controller driver when I have the time. Maybe
there is some errata for this...
> I would also point out that MDIO is not idempotent. Reading an
> interrupt status register often clears it. Reading the link status
> clears the latched link status. If you need to retry the read of the
> interrupt status register, you cannot, the interrupt has been cleared,
> you have lost it, and probably your hardware no longer works because
> you don't know what interrupt to handle.... If you need to re-read the
> link status, you have lost the latched version, and you have missed a
> up or down event.
Yes. Same thing with I2C.
>> >> My advice would be to investigate the hardware in the first instance.
>
> I agree with Russell. Figure out why I2C is flaky. Since this is an
> SFP it maybe something as trivial as the contacts need cleaning. Or
> the resistors are wrong, or you have a cheap module which is out of
> spec.
OK, I'll try to dig into this a little more...
--Sean
WARNING: multiple messages have this Message-ID (diff)
From: Sean Anderson <sean.anderson@linux.dev>
To: Andrew Lunn <andrew@lunn.ch>
Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>,
Alex Williams <alex.williams@ni.com>,
Andi Shyti <andi.shyti@kernel.org>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
linux-i2c@vger.kernel.org, Michal Simek <michal.simek@amd.com>,
Heiner Kallweit <hkallweit1@gmail.com>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [BUG] SFP I2C timeout forces link down with PHY_ERROR
Date: Thu, 30 May 2024 12:56:18 -0400 [thread overview]
Message-ID: <93e8839d-e712-4708-a2ca-df81051b8360@linux.dev> (raw)
In-Reply-To: <1398a492-95aa-46d9-b52b-a374fd6e9e77@lunn.ch>
On 5/28/24 14:14, Andrew Lunn wrote:
> On Tue, May 28, 2024 at 01:52:56PM -0400, Sean Anderson wrote:
>> (forgot to CC Alex)
>>
>> On 5/28/24 13:50, Sean Anderson wrote:
>> > On 5/28/24 13:28, Russell King (Oracle) wrote:
>> >> First, note that phylib's policy is if it loses comms with the PHY,
>> >> then the link will be forced down. This is out of control of the SFP
>> >> or phylink code.
>> >>
>> >> I've seen bugs with the I2C emulation on some modules resulting in
>> >> problems with various I2C controllers.
>> >>
>> >> Sometimes the problem is due to a bad I2C level shifter. Some I2C
>> >> level shifter manufacturers will swear blind that their shifter
>> >> doesn't lock up, but strangely, one can prove with an osciloscope
>> >> that it _does_ lock up - and in a way that the only way to recover
>> >> was to possibly unplug the module or poewr cycle the platform.
>> >
>> > Well, I haven't seen any case where the bus locks up. I've been able to
>> > recover just by doing
>> >
>> > ip link set net0 down
>> > ip link set net0 up
>> >
>> > which suggests that this is just a transient problem.
>
> If you look back over the history, i don't think you will find any
> reports to transient problems with real MDIO busses. Hence any error
> is considered fatal. Also, when you consider the design of MDIO, it is
> actually very hard for an error to be detected. It is basically a
> shift register, shifting out 64 bits for a write, or 48 bits for a
> read, followed by receiving 16 bits for a read. There is no protocol
> to indicate any sort of error. If there is no device at the address,
> the pullup means you receive 1s. End of story.
Yes, I would expect the only time there could be transient problems
would be with external MII (such as if someone jiggled the phy).
> With MDIO over I2C, it is I2C which has problems, not MDIO. Do you
> expect transient problems with I2C?
Well, I2C is known to have devices which can get stuck and hang the bus
(generally requiring some bit-banging from Linux to get things unstuck,
or a reset of the device). So while I2C (like MDIO) is supposed to be
completely reliable, there is a history of it being not quite perfect.
That said, I did not expect to see these kinds of errors at all. I'll
have a closer look at the controller driver when I have the time. Maybe
there is some errata for this...
> I would also point out that MDIO is not idempotent. Reading an
> interrupt status register often clears it. Reading the link status
> clears the latched link status. If you need to retry the read of the
> interrupt status register, you cannot, the interrupt has been cleared,
> you have lost it, and probably your hardware no longer works because
> you don't know what interrupt to handle.... If you need to re-read the
> link status, you have lost the latched version, and you have missed a
> up or down event.
Yes. Same thing with I2C.
>> >> My advice would be to investigate the hardware in the first instance.
>
> I agree with Russell. Figure out why I2C is flaky. Since this is an
> SFP it maybe something as trivial as the contacts need cleaning. Or
> the resistors are wrong, or you have a cheap module which is out of
> spec.
OK, I'll try to dig into this a little more...
--Sean
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2024-05-30 16:56 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-28 16:57 [BUG] SFP I2C timeout forces link down with PHY_ERROR Sean Anderson
2024-05-28 16:57 ` Sean Anderson
2024-05-28 17:28 ` Russell King (Oracle)
2024-05-28 17:28 ` Russell King (Oracle)
2024-05-28 17:50 ` Sean Anderson
2024-05-28 17:50 ` Sean Anderson
2024-05-28 17:52 ` Sean Anderson
2024-05-28 17:52 ` Sean Anderson
2024-05-28 18:14 ` Andrew Lunn
2024-05-28 18:14 ` Andrew Lunn
2024-05-30 16:56 ` Sean Anderson [this message]
2024-05-30 16:56 ` Sean Anderson
2024-05-28 18:22 ` Russell King (Oracle)
2024-05-28 18:22 ` Russell King (Oracle)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=93e8839d-e712-4708-a2ca-df81051b8360@linux.dev \
--to=sean.anderson@linux.dev \
--cc=alex.williams@ni.com \
--cc=andi.shyti@kernel.org \
--cc=andrew@lunn.ch \
--cc=hkallweit1@gmail.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-i2c@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@armlinux.org.uk \
--cc=michal.simek@amd.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.