From: Sean Anderson <sean.anderson@linux.dev>
To: Andrew Lunn <andrew@lunn.ch>,
Russell King <linux@armlinux.org.uk>,
Andi Shyti <andi.shyti@kernel.org>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
linux-i2c@vger.kernel.org
Cc: Michal Simek <michal.simek@amd.com>,
Heiner Kallweit <hkallweit1@gmail.com>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: [BUG] SFP I2C timeout forces link down with PHY_ERROR
Date: Tue, 28 May 2024 12:57:25 -0400 [thread overview]
Message-ID: <ec7907f1-cb5a-41ab-824c-aa0b02440ada@linux.dev> (raw)
Hi,
I saw the following warning [1] twice when testing 1000Base-T SFP
modules:
[ 1481.682501] cdns-i2c ff030000.i2c: timeout waiting on completion
[ 1481.692010] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed
[ 1481.699910] ------------[ cut here ]------------
[ 1481.705459] phy_check_link_status+0x0/0xe8: returned: -67
[ 1481.711448] WARNING: CPU: 2 PID: 67 at drivers/net/phy/phy.c:1233 phy_state_machine+0xac/0x2ec
<snip>
[ 1481.904544] macb ff0c0000.ethernet net1: Link is Down
and a second time with some other errors too:
[ 64.972751] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted!
[ 64.979478] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted!
[ 65.998108] cdns-i2c ff030000.i2c: timeout waiting on completion
[ 66.010558] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed
[ 66.017856] ------------[ cut here ]------------
[ 66.022786] phy_check_link_status+0x0/0xcc: returned: -67
[ 66.028255] WARNING: CPU: 0 PID: 70 at drivers/net/phy/phy.c:1233 phy_state_machine+0xa4/0x2b8
<snip>
[ 66.339533] macb ff0c0000.ethernet net1: Link is Down
The chain of events is:
- The I2C transaction times out for some reason (in the latter case due
to a known hardware bug).
- mdio-i2c converts the error response to a 0xffff return value
- genphy_read_lpa sees that LPA_1000MSFAIL is set in MII_STAT1000 and
returns -ENOLINK. This propagates up the calls stack.
- phy_check_link_status returns -ENOLINK
- phy_error_precise forces the link down with state = PHY_ERROR.
The problem with this is that although the register read fails due to a
temporary condition, the link goes down permanently (or at least until
the admin cycles the interface state).
I think some part of the stack should implement a retry mechanism, but
I'm not sure which part. One idea could be to have mdio-i2c propagate
negative errors instead of converting them to successful reads of
0xffff. But we would still need to handle that in the phy driver or in
phy_state_machine.
- Are I2C bus drivers supposed to be flaky like this? That is, are callers of
i2c_transfer expected to handle the occasional spurious error?
- Similarly, are MDIO bus drivers allowed to be flaky?
- Is ETIMEDOUT even supposed to be recoverable? Maybe we should have
cdns-i2c return EAGAIN instead so it gets retried by the bus
arbitration logic in __i2c_transfer.
- ENOLINK really seems like something which we could recover from by
resetting the phy (or even just waiting a bit). Maybe we should have
the phy state machine just switch to PHY_NOLINK?
Of course, the best option would be to fix cdns-i2c to not be buggy, but
the hardware itself is buggy in at least one of the above cases so that
may not be practical.
--Sean
WARNING: multiple messages have this Message-ID (diff)
From: Sean Anderson <sean.anderson@linux.dev>
To: Andrew Lunn <andrew@lunn.ch>,
Russell King <linux@armlinux.org.uk>,
Andi Shyti <andi.shyti@kernel.org>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
linux-i2c@vger.kernel.org
Cc: Michal Simek <michal.simek@amd.com>,
Heiner Kallweit <hkallweit1@gmail.com>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: [BUG] SFP I2C timeout forces link down with PHY_ERROR
Date: Tue, 28 May 2024 12:57:25 -0400 [thread overview]
Message-ID: <ec7907f1-cb5a-41ab-824c-aa0b02440ada@linux.dev> (raw)
Hi,
I saw the following warning [1] twice when testing 1000Base-T SFP
modules:
[ 1481.682501] cdns-i2c ff030000.i2c: timeout waiting on completion
[ 1481.692010] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed
[ 1481.699910] ------------[ cut here ]------------
[ 1481.705459] phy_check_link_status+0x0/0xe8: returned: -67
[ 1481.711448] WARNING: CPU: 2 PID: 67 at drivers/net/phy/phy.c:1233 phy_state_machine+0xac/0x2ec
<snip>
[ 1481.904544] macb ff0c0000.ethernet net1: Link is Down
and a second time with some other errors too:
[ 64.972751] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted!
[ 64.979478] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted!
[ 65.998108] cdns-i2c ff030000.i2c: timeout waiting on completion
[ 66.010558] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed
[ 66.017856] ------------[ cut here ]------------
[ 66.022786] phy_check_link_status+0x0/0xcc: returned: -67
[ 66.028255] WARNING: CPU: 0 PID: 70 at drivers/net/phy/phy.c:1233 phy_state_machine+0xa4/0x2b8
<snip>
[ 66.339533] macb ff0c0000.ethernet net1: Link is Down
The chain of events is:
- The I2C transaction times out for some reason (in the latter case due
to a known hardware bug).
- mdio-i2c converts the error response to a 0xffff return value
- genphy_read_lpa sees that LPA_1000MSFAIL is set in MII_STAT1000 and
returns -ENOLINK. This propagates up the calls stack.
- phy_check_link_status returns -ENOLINK
- phy_error_precise forces the link down with state = PHY_ERROR.
The problem with this is that although the register read fails due to a
temporary condition, the link goes down permanently (or at least until
the admin cycles the interface state).
I think some part of the stack should implement a retry mechanism, but
I'm not sure which part. One idea could be to have mdio-i2c propagate
negative errors instead of converting them to successful reads of
0xffff. But we would still need to handle that in the phy driver or in
phy_state_machine.
- Are I2C bus drivers supposed to be flaky like this? That is, are callers of
i2c_transfer expected to handle the occasional spurious error?
- Similarly, are MDIO bus drivers allowed to be flaky?
- Is ETIMEDOUT even supposed to be recoverable? Maybe we should have
cdns-i2c return EAGAIN instead so it gets retried by the bus
arbitration logic in __i2c_transfer.
- ENOLINK really seems like something which we could recover from by
resetting the phy (or even just waiting a bit). Maybe we should have
the phy state machine just switch to PHY_NOLINK?
Of course, the best option would be to fix cdns-i2c to not be buggy, but
the hardware itself is buggy in at least one of the above cases so that
may not be practical.
--Sean
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next reply other threads:[~2024-05-28 16:57 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-28 16:57 Sean Anderson [this message]
2024-05-28 16:57 ` [BUG] SFP I2C timeout forces link down with PHY_ERROR Sean Anderson
2024-05-28 17:28 ` Russell King (Oracle)
2024-05-28 17:28 ` Russell King (Oracle)
2024-05-28 17:50 ` Sean Anderson
2024-05-28 17:50 ` Sean Anderson
2024-05-28 17:52 ` Sean Anderson
2024-05-28 17:52 ` Sean Anderson
2024-05-28 18:14 ` Andrew Lunn
2024-05-28 18:14 ` Andrew Lunn
2024-05-30 16:56 ` Sean Anderson
2024-05-30 16:56 ` Sean Anderson
2024-05-28 18:22 ` Russell King (Oracle)
2024-05-28 18:22 ` Russell King (Oracle)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ec7907f1-cb5a-41ab-824c-aa0b02440ada@linux.dev \
--to=sean.anderson@linux.dev \
--cc=andi.shyti@kernel.org \
--cc=andrew@lunn.ch \
--cc=hkallweit1@gmail.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-i2c@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@armlinux.org.uk \
--cc=michal.simek@amd.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.