linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Sean Anderson <sean.anderson@linux.dev>
To: Andrew Lunn <andrew@lunn.ch>,
	Russell King <linux@armlinux.org.uk>,
	Andi Shyti <andi.shyti@kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	linux-i2c@vger.kernel.org
Cc: Michal Simek <michal.simek@amd.com>,
	Heiner Kallweit <hkallweit1@gmail.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: [BUG] SFP I2C timeout forces link down with PHY_ERROR
Date: Tue, 28 May 2024 12:57:25 -0400	[thread overview]
Message-ID: <ec7907f1-cb5a-41ab-824c-aa0b02440ada@linux.dev> (raw)

Hi,

I saw the following warning [1] twice when testing 1000Base-T SFP
modules:

[ 1481.682501] cdns-i2c ff030000.i2c: timeout waiting on completion
[ 1481.692010] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed
[ 1481.699910] ------------[ cut here ]------------
[ 1481.705459] phy_check_link_status+0x0/0xe8: returned: -67
[ 1481.711448] WARNING: CPU: 2 PID: 67 at drivers/net/phy/phy.c:1233 phy_state_machine+0xac/0x2ec
<snip>
[ 1481.904544] macb ff0c0000.ethernet net1: Link is Down

and a second time with some other errors too:

[   64.972751] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted!
[   64.979478] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted!
[   65.998108] cdns-i2c ff030000.i2c: timeout waiting on completion
[   66.010558] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed
[   66.017856] ------------[ cut here ]------------
[   66.022786] phy_check_link_status+0x0/0xcc: returned: -67
[   66.028255] WARNING: CPU: 0 PID: 70 at drivers/net/phy/phy.c:1233 phy_state_machine+0xa4/0x2b8
<snip>
[   66.339533] macb ff0c0000.ethernet net1: Link is Down

The chain of events is:

- The I2C transaction times out for some reason (in the latter case due
  to a known hardware bug).
- mdio-i2c converts the error response to a 0xffff return value
- genphy_read_lpa sees that LPA_1000MSFAIL is set in MII_STAT1000 and
  returns -ENOLINK. This propagates up the calls stack.
- phy_check_link_status returns -ENOLINK
- phy_error_precise forces the link down with state = PHY_ERROR.

The problem with this is that although the register read fails due to a
temporary condition, the link goes down permanently (or at least until
the admin cycles the interface state).

I think some part of the stack should implement a retry mechanism, but
I'm not sure which part. One idea could be to have mdio-i2c propagate
negative errors instead of converting them to successful reads of
0xffff. But we would still need to handle that in the phy driver or in
phy_state_machine.

- Are I2C bus drivers supposed to be flaky like this? That is, are callers of
  i2c_transfer expected to handle the occasional spurious error?
- Similarly, are MDIO bus drivers allowed to be flaky?
- Is ETIMEDOUT even supposed to be recoverable? Maybe we should have
  cdns-i2c return EAGAIN instead so it gets retried by the bus
  arbitration logic in __i2c_transfer.
- ENOLINK really seems like something which we could recover from by
  resetting the phy (or even just waiting a bit). Maybe we should have
  the phy state machine just switch to PHY_NOLINK?

Of course, the best option would be to fix cdns-i2c to not be buggy, but
the hardware itself is buggy in at least one of the above cases so that
may not be practical.

--Sean

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

             reply	other threads:[~2024-05-28 16:57 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-28 16:57 Sean Anderson [this message]
2024-05-28 17:28 ` [BUG] SFP I2C timeout forces link down with PHY_ERROR Russell King (Oracle)
2024-05-28 17:50   ` Sean Anderson
2024-05-28 17:52     ` Sean Anderson
2024-05-28 18:14       ` Andrew Lunn
2024-05-30 16:56         ` Sean Anderson
2024-05-28 18:22     ` Russell King (Oracle)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ec7907f1-cb5a-41ab-824c-aa0b02440ada@linux.dev \
    --to=sean.anderson@linux.dev \
    --cc=andi.shyti@kernel.org \
    --cc=andrew@lunn.ch \
    --cc=hkallweit1@gmail.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-i2c@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=michal.simek@amd.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).