From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-oo1-f41.google.com (mail-oo1-f41.google.com [209.85.161.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0AF333A6F6 for ; Tue, 24 Feb 2026 22:05:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.161.41 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771970757; cv=none; b=cJ2qMf3mUwuzJ/mqm29McnkqakPPdSRyIx7FWMxuDwI7HV61eLjFueDh+pvVSz4FQL0NZE+T5ugDG0hGw1snuzWaA5H/s7o5Jv/IItOwfGcJok3aPDDr1IJ/NSBMcnyU40UsKJS8ARN0v1eMbkc8lUTCtOL3V2D9xfdhnNQFAPI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771970757; c=relaxed/simple; bh=5BQrVr3OxgnznwGKDnAbjzEw/ILtBXMAsJChxTDqWNg=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Q4Zk5dqTvQoiSwFuWqheaiUjm5aSInS4fJAT7SxuhU6Gfh/FZ9LTXey+8oq4U564bhNUZv9S4LJOE3Y9AJYvPvFKDjYt8GVngb9yu7M/xjykr9+3ozOk80iVKPbZY30wtgOoEgqJnxt07Y5ZT2JZZl10MNelQ4iaU+94npBTD4Q= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=bsZCJG7s; arc=none smtp.client-ip=209.85.161.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="bsZCJG7s" Received: by mail-oo1-f41.google.com with SMTP id 006d021491bc7-679e5688b25so479604eaf.1 for ; Tue, 24 Feb 2026 14:05:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771970754; x=1772575554; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=TaG8hwC2DP9/pQRfQce4RRjKTObpfeEkNQAyxe91JQY=; b=bsZCJG7sXvjpqE8GALz6faawK0ywt4zcGR/d0LDPBhQ0rBbqNItKWsje13BcKl/sfV pQufsGPxivVLZm+IPBTFXVLUVaYokBZ8DoO4prkPKrTiI2vl4AXp3JwSSkI271oZLnFG ZMy6ZCUgIKJTfi+tfoKyrWtPIOM5Y5Ovlm/0nshOjlwB7puD03+MFzCiXOxO7AVkH51S 6mOIlRyq71DgEWlzrlYo5YZqIhkQm8u6AW7ZIpxJGuLWeR7OtA09/vO//NKf96ZUvaID iD0UUIWTlGHkR0kGpc6Eb9niyEM3d7Z1QZpK0d9KPLJEiX9AyB/5SUqh7uf5SDg8oyVN CbOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771970754; x=1772575554; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TaG8hwC2DP9/pQRfQce4RRjKTObpfeEkNQAyxe91JQY=; b=CKIKyOlQ982RZ9NJ0C+cv/VhMMDAuChBh/FoSFDHRSIgagKtBC22QzW0iuu7wAxaYe TeptE2YkC0sFB+G/dHLP6Sd32kUHqOa7gFHWF1Q+epADK/AKQ4FRM1YaMikG+w25ejGn XBKUruEAvehv4wOcAGBQJ1zrg+B3lDffwDaoyNqd4nH6tRBTLCmnaH/kdUvrb33YCA1T NuLeNsASWgUbmoNt/TFaNBo9CunPklwNUkqmzYAYg9lezX+54Ux0OW4FQzmRupDoxFPx S0K9IcIKffSu0lq72Ed9bgB5TJDOR9VYyaRo/NDXxEp1HP/WnfQhiFq3otNA+NGaf5jX RVHA== X-Forwarded-Encrypted: i=1; AJvYcCXQrf8ybWz4J6YozhYX7oSV925M2XLLaSNvDX4xNHaTKWZ4RmrIh0hWqp0Tmu1zMDtABpSzqaXfbXdPxgE=@vger.kernel.org X-Gm-Message-State: AOJu0Yw2uo32qsOi2cEGCF/vJmeZUIoVWY40bSCWlzZ7OiI7+/1DDsQf 1LWgoN+UpizfK7ybFhGlqFOsTSxYe7clVTQg4WSzAxNZMw/M65bUTNcPLYU4iQ== X-Gm-Gg: AZuq6aIVT86siK7usirwjT6Wf/Fv7RMAgsr3oPFoTyB8xUe8h4vzW2p+ou/PALQqPwW 3sImGqDsWeh8iSBQlG49DDldRDGU/VJP+EfvALmiV411+5PqCCf63x/TzD73/bwZSb7G2E7IAdT A1gtxIA1+WldjTmp8/P2BjeTMKiTuvEQSRTydD7UqBZAdmEG4xb171UW+1ezoFyvgzy/erp1lNc 9H77uGAlfquNOLuhyovrj1ryQOG/xINqIfaVT82D43vqJW3bO5SrODrbeEA+8Lt+RTISU85yX9j LrECJLFqn7YUHThglajejLPXPSC8yV7Zv7gNjAv1yHoHYYJ7+pjjDk78MfTKxNWML6GT+6vMfXC fzQ6znb6muYsT5O6NTnPnVkPIItP4MBiJWnJ2CylU3rJXKmakiEBe9JmpuTTIA8WKmSiWlQ0rul fG0EFIjRBM1ilxP84IXtLFF9PT4jraXvhAM24hL6AZCQzEcSPYgW0NjR/XFZ1dKldg X-Received: by 2002:a05:7300:691f:b0:2b7:4157:9afc with SMTP id 5a478bee46e88-2bd7bd23dfdmr5324137eec.19.1771965531786; Tue, 24 Feb 2026 12:38:51 -0800 (PST) Received: from google.com ([2a00:79e0:2ebe:8:e1d3:e70f:c61b:7501]) by smtp.gmail.com with ESMTPSA id 5a478bee46e88-2bd7dbe78b9sm7329307eec.18.2026.02.24.12.38.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Feb 2026 12:38:51 -0800 (PST) Date: Tue, 24 Feb 2026 12:38:48 -0800 From: Dmitry Torokhov To: Vladimir Oltean Cc: Vinod Koul , Kishon Vijay Abraham I , Neil Armstrong , "Rafael J. Wysocki" , Geert Uytterhoeven , Johan Hovold , Claudiu Beznea , "Dr. David Alan Gilbert" , Peter Griffin , Dmitry Baryshkov , Krzysztof Kozlowski , Zijun Hu , linux-phy@lists.infradead.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate() Message-ID: References: <20260223231500.zeffezslctqamhp7@skbuf> <20260224154730.qqnomchkdpxnyf4x@skbuf> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260224154730.qqnomchkdpxnyf4x@skbuf> On Tue, Feb 24, 2026 at 05:47:30PM +0200, Vladimir Oltean wrote: > On Mon, Feb 23, 2026 at 04:37:39PM -0800, Dmitry Torokhov wrote: > > > You have hit on a weak spot of the Generic PHY framework and I would > > > like to encourage you to follow through with patches for this finding > > > (although it won't be exactly trivial, I think it's doable with some > > > determination). > > > > Thank you for the detailed response. I am not sure how much time I can > > spend on phy rework as this change was an essentially a drive-by. I only > > happen to look there because I want to remove > > class_find_device_by_of_node() in favor of class_find_device_by_fwnode(). > > OK, I believe you can still do that without any dependency. Right, this is independent, I just happened to look at that function. > > > > On your direct question: > > > It will impact downstream in subtle and unpleasant ways if you change > > > the of_xlate() semantics w.r.t. reference counting, but the code will > > > still compile just as before, i.e. when looking just at your downstream > > > driver (what 99% of developers do) there will be no obvious way of > > > knowing that something in the API has changed. I would avoid doing that. > > > > Hmm, I was not aware that we needed to take particular measures for > > benefit of downstream trees... > > I think you're entirely missing the point. > > Obviously it benefits the maintainer/reviewer first and foremost. If you > don't make the API change unmissable, drivers expecting the old convention > will eventually creep their way into the mainline kernel. Why put avoidable > pressure on reviewers watching out for things like this for years to come, > when you can force downstream to notice and adapt (OR not make the > change in the first place). Note that downstream means "API consumers > invisible to you, the API changer", not only "hopelessly un-upstreamable > drivers". Well, this is whole "stable kernel API" argument. Rules of lifetime around objects may change and the kernel needs to adapt. Anyway, this is your subsystem so you get to decide. > > > > The patch above leaves a few loose ends. > > > > > > The most obvious is of_phy_simple_xlate(), which has that put_device() > > > to balance out class_find_device_by_of_node() - which bumps the device > > > refcount to 2. It "looks" wrong but it is consistent with vendor > > > implementations of of_xlate(), which also provide a phy->dev refcount of 1. > > > And as explained, the refcount never _actually_ drops to 0 with the > > > above patch. > > > > > > I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake. > > > Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I > > > would offer a new helper, devm_of_phy_simple_provider_register(), which > > > would internally use a callback that doesn't drop the refcount (say > > > __of_phy_simple_xlate() for lack of imagination). I would convert vendor > > > PHY drivers one by one to use this (it would be valid at any time to use > > > either the old or the new method). > > > > > > You'd notice that most of the of_phy_simple_xlate() occurrences go away, > > > except for freescale/phy-fsl-lynx-28g.c which calls this function > > > directly from its own lynx_28g_xlate(). It will still have to keep > > > calling it, and for refcount equalization purposes that weird > > > put_device() will have to continue to exist. Just leave a comment as to > > > why that is. > > > > I this case I would simply add a comment telling why the reference > > should (and can) be dropped where it is being dropped in > > of_phy_simple_xlate() and call it a day. There is not much benefit in > > adding another helper. > > OK. > > > > But there are more important loose ends still. > > > > > > I mentioned that "zombie" device. We've solved the memory safety issue, > > > but it's possible for consumers to hold onto a phy whose provider has > > > disappeared. The refcount of &phy->dev hasn't dropped to 0, so > > > technically it's a valid object, but from PHY API perspective, it's > > > still possible to call phy_init(), phy_power_on() and friends on this > > > PHY, and the Generic PHY core will be happy to further call into the > > > phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound, > > > so it should really be left alone. > > > > > > If we fix the UAF but leave the zombie PHY problem, we've effectively > > > done nothing but silence static analysis checkers, while the code path > > > where the PHY provider unbinds effectively remains treated as poorly as > > > before, just moving the crashes to a different place. > > > > > > I suspect what needs to be done here is to introduce a "bool dead" or > > > similar, which is to be set from phy_destroy() and checked from every > > > API call. The idea is that API functions on zombie PHYs should fail > > > without calling into their driver. I **suppose** that > > > try_module_get(phy->ops->owner) "tried" to avoid this situation, but > > > it just protects against module removal, not against "echo device > > > > /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable. > > > > I think this is a problem common to many kernel subsystems, where there > > are devices that are vital for other devices to function, but are not > > parents of said devices. Think about regulators or clocks and such. > > In many such cases even validating at API level is not sufficient, > > because if you enable a clock you typically keep it running at least for > > a while, if not for entire duration of device being bound to a driver. > > If that clock goes away the device will break even though you are not > > calling any clock APIs at that particular time. > > > > We need some kind of revocation mechanism to signal consumers that > > providers they rely upon are going away. > > Yeah, this gave me pause. > > When you unbind a Generic PHY you can kind of expect that the data path > it affects to not work. But you'd sort of expect it to start working > again when you rebind the PHY driver. > > That will not be the case, though. > > The problem, simply put, is that the struct phy that the consumer has > will be different from the second struct phy that the provider registers > when it rebinds. Using the stale struct phy, we cannot reach the phy_ops > of the new provider. > > If we implement something similar to what Bartosz Golaszewski suggested > here: > https://lpc.events/event/17/contributions/1627/ There is also "revocable" from Tzung-Bi Shih. > > aka decouple the struct phy given to the consumer from the struct phy > provided by the provider, and perform a PHY lookup during each > phy_init() / phy_power_up() / etc etc operation; > > we are kinda able to: > - match the consumer phy to the provider phy and forward the call to > phy_ops, if the provider is present (or rebound) > - return -ENODEV instead of calling into phy_ops, if the provider is not > present > > But there's still one big gap. > > PHY consumers are driving the PHY through a state machine when they do > phy_init -> phy_power_up() -> phy_set_mode() -> [ phy_configure()...]. > When the provider unbinds and then rebinds, that state is lost. But the > consumer has no idea. It knows it is in a state where it called > phy_power_up(), and next thing it knows, the power_count is 1 and it > must call phy_power_down(). > > The Generic PHY core cannot actually replay the entire state in a > meaningful way so as to make the provider reprobing completely > transparent (think of phy_calibrate() calls). > > So I guess we're looking at what Bartosz refers to as a notification > side-channel that the PHY provider is going away. > > At least, for drivers that care in a meaningful way. For drivers that > don't care about such complexity, there seems to be a simpler answer: > device_link_add(DL_FLAG_AUTOREMOVE_CONSUMER). Does DL_FLAG_AUTOREMOVE_CONSUMER result in forceful unbinding of the consumer when provider goes away? And if it happens does it happen in the right order? > > I can also answer why device links with "autoremove consumer" are not a > universal answer. This is because the consumer itself may be a > multi-port network switch (single device). If you unbind the Generic PHY > driver (retimer, SerDes PHY) from port 3, you don't want ports 0, 1 and > 2 to also disappear from the kernel. You need the more complex PHY > provider disappearance notification which does something more localized > (calls phylink_stop(), I don't know). Aren't the ports represented as individual devices with their own state and lifetime? > > > I can say right off the bat that this is too complicated for me to even > think about in more detail than that, at the moment. I would be quite > happy if it's possible to unbind the PHY driver without the possibility > to rebind it, as a first step. Yes, this is something that is not phy specific and I think should be solved at driver core (or at least driver core should provide subsystems tools to solve this). Thanks. -- Dmitry