netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zefir Kurtisi <zefir.kurtisi@neratec.com>
To: Florian Fainelli <f.fainelli@gmail.com>, netdev@vger.kernel.org
Cc: andrew@lunn.ch
Subject: Re: [PATCH] phy state machine: failsafe leave invalid RUNNING state
Date: Wed, 4 Jan 2017 17:10:14 +0100	[thread overview]
Message-ID: <8521b51f-04f7-aeef-f862-bb150257cfa4@neratec.com> (raw)
In-Reply-To: <16a741c1-7005-b1df-f2e6-afdbe9d086c8@gmail.com>

On 01/04/2017 04:30 PM, Florian Fainelli wrote:
> 
> 
> On 01/04/2017 07:27 AM, Zefir Kurtisi wrote:
>> On 01/04/2017 04:13 PM, Florian Fainelli wrote:
>>>
>>>
>>> On 01/04/2017 07:04 AM, Zefir Kurtisi wrote:
>>>> While in RUNNING state, phy_state_machine() checks for link changes by
>>>> comparing phydev->link before and after calling phy_read_status().
>>>> This works as long as it is guaranteed that phydev->link is never
>>>> changed outside the phy_state_machine().
>>>>
>>>> If in some setups this happens, it causes the state machine to miss
>>>> a link loss and remain RUNNING despite phydev->link being 0.
>>>>
>>>> This has been observed running a dsa setup with a process continuously
>>>> polling the link states over ethtool each second (SNMPD RFC-1213
>>>> agent). Disconnecting the link on a phy followed by a ETHTOOL_GSET
>>>> causes dsa_slave_get_settings() / dsa_slave_get_link_ksettings() to
>>>> call phy_read_status() and with that modify the link status - and
>>>> with that bricking the phy state machine.
>>>
>>> That's the interesting part of the analysis, how does this brick the PHY
>>> state machine? Is the PHY driver changing the link status in the
>>> read_status callback that it implements?
>>>
>> phydev->read_status points to genphy_read_status(), where the first call goes to
>> genphy_update_link() which updates the link status.
>>
>> Thereafter phy_state_machine():RUNNING won't be able to detect the link loss
>> anymore unless the link state changes again.
>>
>>
>> I was trying to figure out if there is a rule that forbids changing phydev->link
>> from outside the state machine, but found several places where it happens (either
>> directly, or over genphy_read_status() or over genphy_update_link()).
>>
>> Curious how this did not show up before, since within the dsa setup it is very
>> easy to trigger:
>> a) physically disconnect link
>> b) within one second run ethtool ethX
> 
> You need to be more specific here about what "the dsa setup" is, drivers
> involved, which ports of the switch you are seeing this with (user
> facing, CPU port, DSA port?) etc.
> 
I am working on top of LEDE and with that at kernel 4.4.21 - alas I checked the
related source files and believe the effect should be reproducible with HEAD.

The setup is as follows:
mv88e6321:
* ports 0+1 connected to fibre-optics transceivers at fixed 100 Mbps
* port 4 is CPU port
* custom phy driver (replacement for marvell.ko) only populated with
  * .config_init to
    * set fixed speed for ports 0+1 (when in FO mode)
    * run genphy_config_init() for all other modes (here: CPU port)
  * .config_aneg=genphy_config_aneg, .read_status=genphy_read_status


To my understanding, the exact setup is irrelevant - to reproduce the issue it is
enough to have a means of running genphy_update_link() (as done in e.g.
mediatek/mtk_eth_soc.c, dsa/slave.c), or genphy_read_status() (as done in e.g.
hisilicon/hns/hns_enet.c) or phy_read_status() (as done in e.g.
ethernet/ti/netcp_ethss.c, ethernet/aeroflex/greth.c, etc.). In the observed
drivers it is mostly implemented in the ETHTOOL_GSET execution path.

Once you get the link state updated outside the phy state machine, it remains in
invalid RUNNING. To prevent that invalid state, to my understanding upper layer
drivers (Ethernet, dsa) must not modify link-states in any case (including calling
the functions noted above), or we need the proposed fail-safe mechanism to prevent
getting stuck.


Thanks,
Zefir

  reply	other threads:[~2017-01-04 16:10 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-04 15:04 [PATCH] phy state machine: failsafe leave invalid RUNNING state Zefir Kurtisi
2017-01-04 15:13 ` Florian Fainelli
2017-01-04 15:27   ` Zefir Kurtisi
2017-01-04 15:30     ` Florian Fainelli
2017-01-04 16:10       ` Zefir Kurtisi [this message]
2017-01-04 16:16         ` Andrew Lunn
2017-01-04 16:24           ` [SIDE DISCUSSION] " Matthias May
2017-01-04 21:44         ` Florian Fainelli
2017-01-05  9:23           ` Zefir Kurtisi
2017-01-05 19:39             ` Florian Fainelli
2017-01-04 20:07 ` kbuild test robot
2017-01-04 20:23 ` kbuild test robot
2017-01-06 11:14 ` [PATCH v2] " Zefir Kurtisi
2017-01-08 23:16   ` David Miller
2017-01-09  1:24   ` Florian Fainelli
2017-01-09 20:38   ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8521b51f-04f7-aeef-f862-bb150257cfa4@neratec.com \
    --to=zefir.kurtisi@neratec.com \
    --cc=andrew@lunn.ch \
    --cc=f.fainelli@gmail.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).