netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Russell King (Oracle)" <linux@armlinux.org.uk>
To: Andrew Lunn <andrew@lunn.ch>
Cc: "Rafał Miłecki" <zajec5@gmail.com>,
	"Network Development" <netdev@vger.kernel.org>,
	"Heiner Kallweit" <hkallweit1@gmail.com>,
	"Robert Marko" <robimarko@gmail.com>,
	"Ansuel Smith" <ansuelsmth@gmail.com>,
	"Daniel Golle" <daniel@makrotopia.org>
Subject: Re: Race in PHY subsystem? Attaching to PHY devices before they get probed
Date: Mon, 22 Jan 2024 16:56:18 +0000	[thread overview]
Message-ID: <Za6eMg0y2QxogfmD@shell.armlinux.org.uk> (raw)
In-Reply-To: <a9e79494-b94a-40f7-9c28-22b6220db5c2@lunn.ch>

On Mon, Jan 22, 2024 at 03:12:42PM +0100, Andrew Lunn wrote:
> On Mon, Jan 22, 2024 at 08:09:58AM +0100, Rafał Miłecki wrote:
> > Hi!
> > 
> > I have MT7988 SoC board with following problem:
> > [   26.887979] Aquantia AQR113C mdio-bus:08: aqr107_wait_reset_complete failed: -110
> > 
> > This issue is known to occur when PHY's firmware is not running. After
> > some debugging I discovered that .config_init() CB gets called while
> > .probe() CB is still being executed.
> > 
> > It turns out mtk_soc_eth.c calls phylink_of_phy_connect() before my PHY
> > gets fully probed and it seems there is nothing in PHY subsystem
> > verifying that. Please note this PHY takes quite some time to probe as
> > it involves sending firmware to hardware.
> > 
> > Is that a possible race in PHY subsystem?
> 
> Seems like it.
> 
> There is a patch "net: phylib: get rid of unnecessary locking" which
> removed locks from probe, which might of helped, but the patch also
> says:
> 
>     The locking in phy_probe() and phy_remove() does very little to prevent
>     any races with e.g. phy_attach_direct(),
> 
> suggesting it probably did not help.

The reason for that statement is because phy_attach_direct() doesn't
take phydev->lock _at all_, so taking the lock in phy_probe() has no
effect on any race with phy_attach_direct().

> I think the traditional way problems like this are avoided is that the
> device should not be visible to the rest of the system until probe has
> completed.

However, we have the problem of the generic driver fallback - which
phy_attach_direct() does.

The probe vs phy_attach_direct() has been racy for quite a long time,
and the poking about that's done in that function such as assigning
struct device's driver member, calling device_bind_driver() etc is
all hellishly racy if the phy_device _could_ be bound simultaneously.

Also note this... we call device_bind_driver() from phy_attach_direct(),
and the documentation for this function states:

 * This function must be called with the device lock held.

which we don't do. So we're already violating the locking requirements
for the driver model.

So, I would suggest that the solution is to make use of device_lock()
which will also only return once a probe has succeeded.

However, that's still not ideal - because the fact we have a race here
means that what could happen is that phy_attach_direct() is called
a little earlier than the probe begins, and the phy device ends up
being bound to the generic PHY driver rather than its specific driver.

I think what this comes down to are the following points:

1) not using the required device model locking
2) the mere existence of the default driver makes for a race between
   the PHY being attached and its driver being probed.

No amount of locking saves us from (2) - the only solutions that I can
see to this are:
1) to put up with there being such a race
2) get rid of the default drivers altogether and insist that we have
   specific PHY drivers for _all_ PHYs
3) have some kind of retry mechanism

A further problem is... we can't simply return -EPROBE_DEFER from
phy_attach_direct() because this function may not be called from
probe functions - it may be called from the .ndo_open method which
has no idea how to handle a probe deferal. Moreover, returning an
error to userspace will just cause it to fail (because all errors
from trying to bring a netdev up are considered to be fatal.)

So, it's a really yucky problem, and I don't see any nice and simple
solution.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

  reply	other threads:[~2024-01-22 16:56 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-22  7:09 Race in PHY subsystem? Attaching to PHY devices before they get probed Rafał Miłecki
2024-01-22  9:48 ` Rafał Miłecki
2024-01-22 14:12 ` Andrew Lunn
2024-01-22 16:56   ` Russell King (Oracle) [this message]
2024-01-24 14:58     ` Christian Marangi
2024-01-24 17:52       ` Andrew Lunn
2024-01-24 19:00         ` Russell King (Oracle)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Za6eMg0y2QxogfmD@shell.armlinux.org.uk \
    --to=linux@armlinux.org.uk \
    --cc=andrew@lunn.ch \
    --cc=ansuelsmth@gmail.com \
    --cc=daniel@makrotopia.org \
    --cc=hkallweit1@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=robimarko@gmail.com \
    --cc=zajec5@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).