From: "Russell King (Oracle)" <linux@armlinux.org.uk>
To: Andrew Lunn <andrew@lunn.ch>
Cc: "Rafał Miłecki" <zajec5@gmail.com>,
"Network Development" <netdev@vger.kernel.org>,
"Heiner Kallweit" <hkallweit1@gmail.com>,
"Robert Marko" <robimarko@gmail.com>,
"Ansuel Smith" <ansuelsmth@gmail.com>,
"Daniel Golle" <daniel@makrotopia.org>
Subject: Re: Race in PHY subsystem? Attaching to PHY devices before they get probed
Date: Mon, 22 Jan 2024 16:56:18 +0000 [thread overview]
Message-ID: <Za6eMg0y2QxogfmD@shell.armlinux.org.uk> (raw)
In-Reply-To: <a9e79494-b94a-40f7-9c28-22b6220db5c2@lunn.ch>
On Mon, Jan 22, 2024 at 03:12:42PM +0100, Andrew Lunn wrote:
> On Mon, Jan 22, 2024 at 08:09:58AM +0100, Rafał Miłecki wrote:
> > Hi!
> >
> > I have MT7988 SoC board with following problem:
> > [ 26.887979] Aquantia AQR113C mdio-bus:08: aqr107_wait_reset_complete failed: -110
> >
> > This issue is known to occur when PHY's firmware is not running. After
> > some debugging I discovered that .config_init() CB gets called while
> > .probe() CB is still being executed.
> >
> > It turns out mtk_soc_eth.c calls phylink_of_phy_connect() before my PHY
> > gets fully probed and it seems there is nothing in PHY subsystem
> > verifying that. Please note this PHY takes quite some time to probe as
> > it involves sending firmware to hardware.
> >
> > Is that a possible race in PHY subsystem?
>
> Seems like it.
>
> There is a patch "net: phylib: get rid of unnecessary locking" which
> removed locks from probe, which might of helped, but the patch also
> says:
>
> The locking in phy_probe() and phy_remove() does very little to prevent
> any races with e.g. phy_attach_direct(),
>
> suggesting it probably did not help.
The reason for that statement is because phy_attach_direct() doesn't
take phydev->lock _at all_, so taking the lock in phy_probe() has no
effect on any race with phy_attach_direct().
> I think the traditional way problems like this are avoided is that the
> device should not be visible to the rest of the system until probe has
> completed.
However, we have the problem of the generic driver fallback - which
phy_attach_direct() does.
The probe vs phy_attach_direct() has been racy for quite a long time,
and the poking about that's done in that function such as assigning
struct device's driver member, calling device_bind_driver() etc is
all hellishly racy if the phy_device _could_ be bound simultaneously.
Also note this... we call device_bind_driver() from phy_attach_direct(),
and the documentation for this function states:
* This function must be called with the device lock held.
which we don't do. So we're already violating the locking requirements
for the driver model.
So, I would suggest that the solution is to make use of device_lock()
which will also only return once a probe has succeeded.
However, that's still not ideal - because the fact we have a race here
means that what could happen is that phy_attach_direct() is called
a little earlier than the probe begins, and the phy device ends up
being bound to the generic PHY driver rather than its specific driver.
I think what this comes down to are the following points:
1) not using the required device model locking
2) the mere existence of the default driver makes for a race between
the PHY being attached and its driver being probed.
No amount of locking saves us from (2) - the only solutions that I can
see to this are:
1) to put up with there being such a race
2) get rid of the default drivers altogether and insist that we have
specific PHY drivers for _all_ PHYs
3) have some kind of retry mechanism
A further problem is... we can't simply return -EPROBE_DEFER from
phy_attach_direct() because this function may not be called from
probe functions - it may be called from the .ndo_open method which
has no idea how to handle a probe deferal. Moreover, returning an
error to userspace will just cause it to fail (because all errors
from trying to bring a netdev up are considered to be fatal.)
So, it's a really yucky problem, and I don't see any nice and simple
solution.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
next prev parent reply other threads:[~2024-01-22 16:56 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-22 7:09 Race in PHY subsystem? Attaching to PHY devices before they get probed Rafał Miłecki
2024-01-22 9:48 ` Rafał Miłecki
2024-01-22 14:12 ` Andrew Lunn
2024-01-22 16:56 ` Russell King (Oracle) [this message]
2024-01-24 14:58 ` Christian Marangi
2024-01-24 17:52 ` Andrew Lunn
2024-01-24 19:00 ` Russell King (Oracle)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Za6eMg0y2QxogfmD@shell.armlinux.org.uk \
--to=linux@armlinux.org.uk \
--cc=andrew@lunn.ch \
--cc=ansuelsmth@gmail.com \
--cc=daniel@makrotopia.org \
--cc=hkallweit1@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=robimarko@gmail.com \
--cc=zajec5@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.