From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from relay4-d.mail.gandi.net (relay4-d.mail.gandi.net [217.70.183.196]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E7D3328387; Mon, 27 May 2024 16:19:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.70.183.196 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716826775; cv=none; b=AOwswcfgKMaOgrfsV2BSJZDYgV3CNNx64961bicA+q0QIS28rWJ9lIbbanEG0sM5n+YMkrJJcPO2CLJ6K+Y+ssq++3YqVhtMDGm2WSZXervDH9WhDHUQmr5I2DsHGE5GhJvaK5oH3k+Paa9Eo2joTTTL+n/I4xE0KVxmkx312GU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716826775; c=relaxed/simple; bh=WG5gFYiA4JdlC+5B5ibo0jKew2w/+2l0N8k8IYUrIxI=; h=Date:From:To:Cc:Subject:Message-ID:MIME-Version:Content-Type; b=K65XePMHELiNppPTStkp9cd1PN88zlCh0yCejbADZeoaRkNezcSqJCwR+PFTVcWt4s0mB0MveK033AP6DmuKkRt2IiLO8SXZWiEScFskMzRTUjbaLEM69sVpULa1ml/tYX217GwGptb6l71/ZDQquB4NQ+WRiq5HST278Czk9Fs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=bootlin.com; spf=pass smtp.mailfrom=bootlin.com; dkim=pass (2048-bit key) header.d=bootlin.com header.i=@bootlin.com header.b=EX56jsut; arc=none smtp.client-ip=217.70.183.196 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=bootlin.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bootlin.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bootlin.com header.i=@bootlin.com header.b="EX56jsut" Received: by mail.gandi.net (Postfix) with ESMTPSA id A3466E0008; Mon, 27 May 2024 16:19:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bootlin.com; s=gm1; t=1716826771; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=ivr7Lv5/QaHOiNl1+Nm8U6SGLCjzfZglSZEAXAi9EvY=; b=EX56jsutXLo1XGVjnSSfguJ38LEy4rnuGeLdj/rNwnYlw9uA6/1lywfqqZ42c8ECb4wYJr H5PTfBGVudrF24TqPaRxhUZuuwRukvJ6M4Cm6E4tsifmJzarwaurod7/NeOeJTP9uBiRnV ZFh4/raWjLtspA3ajP4rUI6B0vjiZ2AV6JJszyf571+yOVOTYgdUk03tXBmZJlb4uDOr4V DJJampCsn0R/D3AW+i60ntbgRFy2b6L/y+ggl/NnYzVyxrlKwZwE1sUM9KYJvPKOqRkRM4 QaRnxH2tD/kL/g5LOJ7NCSKb06C/6EtmiuURofJgfXkcb+YR2U3/XMVg5A6m4A== Date: Mon, 27 May 2024 18:19:28 +0200 From: Miquel Raynal To: Lucas Stach , Laurent Pinchart , Marek Vasut , Stephen Boyd , Michael Turquette , Chen-Yu Tsai , linux-clk@vger.kernel.org, Ulf Hansson , "Rafael J. Wysocki" , Kevin Hilman , linux-pm@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, Fabio Estevam , Jacky Bai , Peng Fan , Shawn Guo , Shengjiu Wang , linux-imx@nxp.com, Ian Ray , Thomas Petazzoni , =?UTF-8?B?SGVydsOp?= Codina , Luca Ceresoli Subject: Clocks vs. Power Domains circular dependency Message-ID: <20240527181928.4fc6b5f0@xps-13> Organization: Bootlin X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; x86_64-pc-linux-gnu) Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-GND-Sasl: miquel.raynal@bootlin.com Hello all, I would like to re-open the clocks vs. runtime PM/power domains topic that is currently still problematic on i.MX8. I am mentioning a specific SoC here but the problem is hardware agnostic and any SoC with a sufficiently advanced power scheme may suffer from the same issues. When (mainline) Linux is running, we get lockdep warnings because of circular dependencies between the clock and power domain subsystems, as well as real deadlocks. I don't have the full history but here is an interesting attempt at fixing this by Marek end of 2022: https://lore.kernel.org/all/Y1cs++TV2GCuh4tS@pendragon.ideasonboard.com/ And here is an interesting discussion that happened on the mailing list following a second attempt from Marek as well: https://patchwork.kernel.org/project/linux-pm/patch/20221108013517.749665-1= -marex@denx.de Looking more closely at this problem, there is indeed a software design issue. Both the clock and power domain subsystems are recursive and interdependent, the power domain subsystem being sometimes accessed through runtime PM calls. [TLDR: I'm drafting something, but issue #4 at the bottom is blocking] ** THE ISSUE: FROM A CLOCK PERSPECTIVE ** Clock changes (not only clk_prepare(), but getting/setting the rates as well, re-parenting and so on) involve runtime resuming part of the clock tree. Eg. registers for accessing/configuring a clock might be in a power domain that must be resumed. Also, the clk subsystem is mostly written using recursivity, so for instance preparing a clock works like this: /* * Same applies to the other clk consumer calls, but instead of going * through the parents, other parts of the clk tree are accessed (the * children subtree, the new parent tree, or both). So the prepare * situation is one of the simplest in this regard, but the following * basically applies to most consumer operations. */ clk_prepare(clk) { clk_prepare(clk->parent); runtime_resume(clk->domain); clk->ops->prepare(); } Now if we add another level of complexity, accesses to the clock tree are serialized by the clk_prepare mutex, and runtime PM calls are also protected by a global genpd lock: clk_prepare(clk) { mutex_lock(clk_prepare); clk_prepare(clk->parent); /* clk_prepare is reentrant, so this is "fine" */ runtime_resume(clk->domain); clk->ops->prepare(); mutex_unlock(clk_prepare); } runtime_resume(domain) { mutex_lock(genpd); domain->ops->resume(); mutex_unlock(genpd); } If we list the locking events in order, we will: - acquire the clk lock - acquire the genPD lock - release the genPD lock - release the clk lock ** THE ISSUE: FROM A POWER DOMAIN PERSPECTIVE ** Now let's see the other side, from the power domain perspective. We can imagine several situations where these locks are acquired in the reverse order. From a hardware standpoint there is a prerequisite (met in the i.MX8 in several places): a power domain is fed by a clock, hence enabling the power domain also implies enabling the feeding clock. #1: Probing/enabling peripherals involve attaching and powering up their power domain. If the power domain is fed by a clock the power domain implementation will request clock changes. #2: Suspending devices happens asynchronously in a worker. Suspending a device implies turning off its power domain, and hence possibly reaching some clock internals as well. In these cases we will play with locks in this order: - acquire the genPD lock - acquire the clk lock - release the clk lock - release the genPD lock Here are two typical examples: power domains playing with upstream clocks: https://elixir.bootlin.com/linux/latest/source/drivers/pmdomain/imx/imx8mp-= blk-ctrl.c#L538 https://elixir.bootlin.com/linux/latest/source/drivers/pmdomain/imx/gpcv2.c= #L340 ** CURRENT UNDERSTANDING ** Clock changes and runtime PM calls can happen at any moment and if your hardware is in a bad mood a dead lock *will* occur. I myself only experienced deadlocks involving the clock worker disabling unused clocks at the end of the boot, but chances are way too high that systems will lock up at some point in production environment because of this circular dependency between the clock and power domain locks. It is possible to avoid the deadlock I experienced by disabling the clk_disable_unused worker and keeping all clocks enabled, but we will all agree that this is a very sub-optimal and extremely unsatisfying solution for a SoC explicitly designed for optimized power consumptions. There is no magic, we must dismangle the situation and prevent either one lock ordering or the other from happening. So we should ask ourselves: what lock order seems the most legitimate? #1: Do clocks really depend on power domains? -> Clock controllers may be part of a power domain, hence the power domain must be enabled in order to access the clock controller registers. -> Same applies if a clock is generated by an external device (eg. an audio codec) and this codec is accessible through a serial bus: the bus controller must be resumed in order to configure the code, and the bus controller may be in its own power domain. #2: Do power domains really depend on clocks? -> Somehow yes. Most power domains are not explicitly "driven" by the upstream clocks described in device trees, ie: keeping these clocks gated does not prevent the use nor the access to the power domains per-se. However, the power domains may redistribute the clocks to the peripherals inside and somehow becoe themselves clock controllers (even though so far they are not flagged like that in any DTS). Hence it appears legitimate that they might want to play with the clock API. DTSI example: https://elixir.bootlin.com/linux/latest/source/arch/arm64/boot/dts/free= scale/imx8mp.dtsi#L1799 TRM extract: 13.2.1 Overview The MEDIAMIX domain contains control and status registers known as Media Block Control (MEDIA BLK_CTRL). These registers have miscellaneous top-level controls for peripherals within the MEDIAMIX domain. Some of these controls includes clocking, [...] 13.2.2: Clocks and Resets The Media BLK_CTRL contains clock distribution and gating controls, as well as reset handling to several of the MEDIAMIX peripherals. -> One specificity of the i.MX8 is that many power domains are also described with a "clocks" property which list clocks that are directly connected to the peripherals inside the power domain instead. The reason for this inaccurate description is: toggling a power domain on or off seems to involve propagating reset signals, which only happens correctly if a few specific clocks feeding the targeted peripherals are actually running. Then, as part of the on/off sequence, it is required that these clocks get enabled (and then disabled) earlier than in their respective probe functions, see: - ae1d2add26a4 ("soc: imx: gpcv2: handle reset clocks") - https://elixir.bootlin.com/linux/latest/source/arch/arm64/boot/dts/fr= eescale/imx8mp.dtsi#L795 - https://elixir.bootlin.com/linux/latest/source/arch/arm64/boot/dts/fr= eescale/imx8mp.dtsi#L857 It is hard to tell, without more details from NXP, how difficult it would be to switch from the #2 "wrong" power domain DTS description to something more accurate and still get the proper init sequence. Although maybe doing that would not be very relevant as there are anyway real clock dependencies from some power domains, as expressed in the SoC manual quoted above. #1 however, despite being a physical constraint, seems also tightly related to the software design: runtime PM calls were introduced in the clock subsystem initially for two reasons: - The clock is itself in an area of the SoC that needs to be powered on before accessing its registers. - The above situation also applies to all the possible parents of the clock. See: 9a34b45397e5 ("clk: Add support for runtime PM") https://lore.kernel.org/all/1503302703-13801-2-git-send-email-m.szyprowski@= samsung.com The clock core being highly recursive, it was way easier (and also probably cleaner) to include nested runtime PM calls like this. But in practice, we could imagine moving the resume path outside of the clock lock and avoid "by software" the conditions under which the two locks are taken. Note: These suggestions have already been made by Ulf: https://patchwork.kernel.org/project/linux-pm/patch/20221108013517.749665-1= -marex@denx.de/#25092554 ** APPROACH UNDER EVALUATION ** As hinted above, maybe we should clarify the conditions under which RPM calls can be made in the clock subsystem, and instead of including the RPM calls within the clock sub-calls, we could consider, on a case-by-case basis: - listing what needs to be resumed for the whole clock operation to succeed first, - then runtime resume the needed domains outside of the main subsystem lock - before continuing with the clock operations and eventually acquiring the main clk_prepare lock. IOW, avoiding the runtime PM calls with the subsystem lock acquired. Let's imagine the following situation: clkA (in power domain A) is parent of clkB (in power domain B). Like above, let's try to picture the call stack: clk_prepare(clkB) { mutex_lock(clk_prepare); clkA =3D clkB->parent; clk_prepare(clkA) { mutex_lock(clk_prepare); /* fine, it's recursive */ runtime_resume(clkA->domain); // BANG clkA->ops->prepare(); mutex_unlock(clk_prepare); /* does nothing */ } runtime_resume(clkB->domain); // BANG again clkB->ops->prepare(); mutex_unlock(clk_prepare); } Now, let's imagine we know what clocks are going to be accessed and thus need to be runtime resumed. Maybe we could do that in advance. The logic could become: clk_prepare(clkB) { + list =3D list_useful_clocks_for_prepare(clkB); + runtime_resume(list); mutex_lock(clk_prepare); clkA =3D clkB->parent; clk_prepare(clkA) { mutex_lock(clk_prepare); /* fine, it's recursive */ - runtime_resume(clkA->domain); clkA->ops->prepare(); mutex_unlock(clk_prepare); /* does nothing */ } - runtime_resume(clkB->domain); clkB->ops->prepare(); mutex_unlock(clk_prepare); } In the above example, the "list_useful_clocks_for_prepare()" is a bit mysterious and has two aspects: first it needs to guess what clocks are needed: this is addressed in the next paragraph. And second, we need to make sure the tree does not change until the lock is acquired, this is addressed in the "#1: Clock tree changes" chapter below. So, regarding the list itself, any operation on any clock in the tree involves the said clock plus one or several clock "lists" among the following choices: A- All its parents, through the clk_core->parent link. =3D> Typically during prepare/unprepare operations. B- All its children and sub children, through all the sub clk_core->children hlist. =3D> Typically during rate operations (get and set, see note below). C- Parents and children of the new parent. =3D> In case of re-parenting, explicitly asked or as a result of a rate change. Side note: the clock core does not always remember the rate of the clocks that were already configured. TBH I did not understand why, because it seems trivial to do. Instead the core physically accesses the clock controller which involves a lot of runtime PM handling. In practice, this is not always the case but some clock controller drivers request to be "asked" to read the actual rate each time. Problem is, they re-calculate the rate and then also propagate the result across all the children, which involves massive amounts of (IMHO) useless wake-ups. This honestly feels a bit sub-optimal, without understanding better why this is for. ** DRAWBACKS OF THIS APPROACH ** #1: Clock tree changes Of course there was a point in making the runtime resume calls inside the clk_prepare lock: the lock protects against clk tree changes. But somehow this is still manageable, we could: - acquire the lock - parse the list of relevant clocks/reference counting them and save them in a list - release the lock - runtime resume the clocks in the list - acquire the lock again - compare the actual tree with the list and try again the whole operation if something changed. Estimated status: manageable. #2: Re-Parenting When performing almost any clock operation, we already know in advance what clocks in the tree will be impacted and thus can precisely guess what needs to be runtime resumed in advance. But this is not working when re-parenting is involved, as the re-parenting opportunity is only tried very deeply in the call stack far beyond acquiring the main lock. One idea to workaround the situation was to somehow save the new parent handle, propagate a specific error up through the call stack (typically, -EAGAIN), up to the point where we can process this error, perform a new round of runtime resume calls over the new parent tree and retry the operation. Estimated status: manageable. #3: Debug dumps might slightly get out of sync There are a couple of situations where all the clocks (or all the orphan clocks, or the root clocks) will be "shown/dumped" by the user. Problem is: it is fine to show all the parents/children of a single clock (typically all the children of one root clock), but involve a lot of code duplication to cover eg. all the root clocks at the same time. If we decide that this is okay, we can iterate over these root clocks, releasing the lock between them, which is obviously racy. Estimated status: probably manageable, maybe not so important. #4: Clock core recursivity and unrelated clocks This is the main drawback with this approach, and that is what motivated this letter in the first place. The clock core is recursive. At first I thought it was useful to simplify the locking mechanism inside the core itself, in particular because the core is mainly built off recursive functions, but there are two actual use cases that are much harder to solve. They are mentioned in the commit introducing the reentrance mechanism: 533ddeb1e86f ("clk: allow reentrant calls into the clk framework") Basically, besides simplifying the locking mechanism in the core itself: A: Enabling a clock may require enabling a "totally unrelated" peripheral first. Typically, an I2C audio codec that would need to be configured in order to produce a clock. In the ->prepare() hook implementation, one can expect some kind of i2c_xfer() call, which in turns will runtime resume the I2C controller used to access the codec. So we end up acquiring again the power domain lock with the clk_prepare lock taken. B: There is one clock driver (tegra/clk-dfll.c) which performs pinctrl changes inside clock operations. If the platform is runtime PM enabled, the pin controller may require being runtime resumed and once again we might fall into the same recursivity issue. I don't see any positive outcome in trying to guess nor discover these cases dynamically and if we want to address that I believe some kind of static declarations must be done by the clock controller drivers (and trying to do that now will inevitably lead to regressions if we miss a single case). Estimated status: problematic. ** CONCLUSION AND QUESTIONS ** I've actually drafted the idea explained above. I need more work to get it up to a working state, as the whole clock core must be converted to this logic in order for it to work. I believe this approach is relevant, but it still has blind spots we need to figure out (#4 above) for which I would like to get some feedback. Maybe however the approach will not be acknowledged and another direction will be pointed. In this case I would be pleased to discuss it and see what we can do. In all cases, the i.MX8 issue is real and must be fixed. If we don't do it properly, soon another power-oriented SoC will also show similar defects. Basically any Linux based operating system running on hardware featuring similarities is subject to deadlock at any moment. I hope this report will be precise enough and raise ideas to help going forward and finally address this issue that has been opened since 2022 (at least) in a rather satisfying way. Thanks, Miqu=C3=A8l PS: There are maybe other ways. Of course I had to pick the one that felt the more promising, even though I initially considered a couple other alternatives: * Addressing the issue in the power domain subsystem ? Seemed less relevant than in the clock core, because of the physical dependencies. But please convince me :-) * I saw comments about changing the clk_prepare lock semantics, but we will always need to serialize accessed to the clock tree and after digging a bit in the clock core I don't see any way out with this solution. * Preventing power domains from gathering clocks? Also a possible approach maybe, even though the situation listed above are far from straightforward to address in this case.