Linux Power Management development
 help / color / mirror / Atom feed
* Re: [PATCH RESEND v1] thermal: core: fix blocking in unregistering zone
From: Guenter Roeck @ 2026-04-08 15:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Jiajia Liu, Daniel Lezcano, Zhang Rui, Lukasz Luba, linux-pm,
	linux-kernel, Armin Wolf, linux-hwmon
In-Reply-To: <CAJZ5v0jfi_gXPVq9E2eJe_0MG4vVojyDo6=ABv4fNFK=Q_qpug@mail.gmail.com>

On 4/8/26 08:05, Rafael J. Wysocki wrote:
> On Sun, Apr 5, 2026 at 5:34 AM Guenter Roeck <linux@roeck-us.net> wrote:
>>
>> On 4/4/26 10:38, Rafael J. Wysocki wrote:
>>> On Sat, Apr 4, 2026 at 4:02 PM Guenter Roeck <linux@roeck-us.net> wrote:
>>>>
>>>> On 4/4/26 05:58, Rafael J. Wysocki wrote:
>>>>> On Fri, Apr 3, 2026 at 4:20 PM Guenter Roeck <linux@roeck-us.net> wrote:
>>>>>>
>>>>>> On 4/3/26 05:52, Rafael J. Wysocki wrote:
>>>>>> .[ ... ]
>>>>>>> It appears to work for me, but I'm not sure if having multiple hwmon class
>>>>>>> devices with the same value in the name attribute is fine.
>>>>>>
>>>>>> Like this ?
>>>>>>
>>>>>> $ cd /sys/class/hwmon
>>>>>> $ grep . */name
>>>>>> hwmon0/name:r8169_0_c00:00
>>>>>> hwmon1/name:nvme
>>>>>> hwmon2/name:nvme
>>>>>> hwmon3/name:nct6687
>>>>>> hwmon4/name:k10temp
>>>>>> hwmon5/name:spd5118
>>>>>> hwmon6/name:spd5118
>>>>>> hwmon7/name:spd5118
>>>>>> hwmon8/name:spd5118
>>>>>> hwmon9/name:mt7921_phy0
>>>>>
>>>>> Yes.
>>>>>
>>>>>> Names such as "r8169_0_c00:00" and "mt7921_phy0" are actually overkill
>>>>>> since the "sensors" command makes it
>>>>>>
>>>>>> r8169_0_c00:00-mdio-0
>>>>>> Adapter: MDIO adapter
>>>>>> temp1:        +36.0°C  (high = +120.0°C)
>>>>>>
>>>>>> mt7921_phy0-pci-0d00
>>>>>> Adapter: PCI adapter
>>>>>> temp1:        +30.0°C
>>>>>>
>>>>>> essentially duplicating the device index.
>>>>>
>>>>> Well, with the patch posted by me, the output of sensors from a test
>>>>> system looks like this:
>>>>>
>>>>> acpitz-acpi-0
>>>>> Adapter: ACPI interface
>>>>> temp1:        +16.8°C
>>>>>
>>>>> pch_cannonlake-virtual-0
>>>>> Adapter: Virtual device
>>>>> temp1:        +33.0°C
>>>>>
>>>>> acpitz-acpi-0
>>>>> Adapter: ACPI interface
>>>>> temp1:        +27.8°C
>>>>>
>>>>> (some further data excluded), which is kind of confusing (note the
>>>>> duplicate acpitz-acpi-0 entries with different values of temp1).
>>>>>
>>>>
>>>> Yes, agreed, that is confusing. I would have expected the second one
>>>> to be identified as "acpitz-acpi-1". Do they both have the same parent ?
>>>
>>> No, they don't.
>>>
>>> The parent of each of them is a thermal zone device and both parents
>>> have the same "type" value.
>>>
>>>>> That could be disambiguated by concatenating the thermal zone ID
>>>>> (possibly after a '_') to the name.  Or the "temp*" things for thermal
>>>>> zones of the same type could carry different numbers.
>>>>>
>>>>> A less attractive alternative would be to register a special virtual
>>>>> device serving as a parent for all hwmon interfaces registered
>>>>> automatically for thermal zones.
>>>>
>>>> If they all have the same parent, technically it should be a single
>>>> hwmon device with multiple sensors, as in:
>>>>
>>>> acpitz-acpi-0
>>>> Adapter: ACPI interface
>>>> temp1:        +16.8°C
>>>> temp2:        +27.8°C
>>>
>>> So somebody tried to make it look like that by registering hwmon
>>> interfaces for all of the thermal zones of the same type under one of
>>> them, but that (quite obviously) doesn't work.
>>
>> Not sure I understand why that doesn't work or why that is obvious,
>> but I'll take you by your word (I would agree that the current
>> _implementation_ looks problematic).
> 
> For example, say that there are two ACPI thermal zones on a system
> 
> /sys/devices/virtual/thermal/thermal_zone0/
> /sys/devices/virtual/thermal/thermal_zone1/
> 
> The current mainline code registers a hwmon class device for thermal_zone0 only:
> 
> /sys/devices/virtual/thermal/thermal_zone0/hwmon0/
> 
> because the type is "acpitz" for both of them, but it adds a sysfs
> attribute that belongs to thermal_zone1 under it:
> 
> /sys/devices/virtual/thermal/thermal_zone0/hwmon0/temp2_input
> 
> There is also
> 
> /sys/devices/virtual/thermal/thermal_zone0/hwmon0/temp1_input
> 
> but it belongs to thermal_zone0.
> 
> Interesting things happen when thermal_zone0 is removed, for example
> because the ACPI thermal driver is unbound from the underlying
> platform device.  Namely, the removal code skips the removal of hwmon0
> because of the temp2_input attribute belonging to thermal_zone1 which
> effectively prevents thermal_zone0 removal from making progress.
> 
> AFAICS, nothing particularly smart can be done to address this issue
> while retaining the current design of the code.  Reparenting hwmon0 to
> thermal_zone1 may confuse user space as well as removing hwmon0 along
> with temp2_input.  That's why I think that this is a design issue.
> 

The ACPI power meter driver has pretty much the same problem. A clear
solution would require making hwmon sysfs attributes dynamic in nature
(i.e., by adding the ability to change the visibility of attributes in
runtime). I have started working on that, but did not have time to
complete the work. The ACPI power meter driver uses a kludge around that:
It unregisters the hwmon device whenever it gets a METER_NOTIFY_CONFIG
event and re-registers it.

Anyway, registering separate hwmon devices, one per thermal zone,
is perfectly fine with me.

Guenter


^ permalink raw reply

* [PATCH v2] interconnect: imx: fix use-after-free in imx_icc_node_init_qos()
From: Wentao Liang @ 2026-04-08 15:30 UTC (permalink / raw)
  To: Georgi Djakov, Shawn Guo, Sascha Hauer
  Cc: Pengutronix Kernel Team, Fabio Estevam, Wentao Liang, linux-pm,
	imx, linux-arm-kernel, linux-kernel, stable

The function imx_icc_node_init_qos() manually manages the reference count
of struct device_node *dn using of_node_put(). However, some error paths
use dn after the put, leading to use-after-free. Convert to automatic
cleanup using __free(device_node) to ensure the reference is always
released when dn goes out of scope.

Fixes: f0d8048525d7 ("interconnect: Add imx core driver")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
Changes in v2:
- Use auto cheanup to fix the problem.
---
 drivers/interconnect/imx/imx.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/interconnect/imx/imx.c b/drivers/interconnect/imx/imx.c
index 9511f80cf041..e5fcdcb88cfb 100644
--- a/drivers/interconnect/imx/imx.c
+++ b/drivers/interconnect/imx/imx.c
@@ -120,7 +120,8 @@ static int imx_icc_node_init_qos(struct icc_provider *provider,
 	struct imx_icc_node *node_data = node->data;
 	const struct imx_icc_node_adj_desc *adj = node_data->desc->adj;
 	struct device *dev = provider->dev;
-	struct device_node *dn = NULL;
+	struct device_node *__free(device_nod) dn = of_parse_phandle(dev->of_node,
+			adj->phandle_name, 0);
 	struct platform_device *pdev;
 
 	if (adj->main_noc) {
@@ -128,7 +129,6 @@ static int imx_icc_node_init_qos(struct icc_provider *provider,
 		dev_dbg(dev, "icc node %s[%d] is main noc itself\n",
 			node->name, node->id);
 	} else {
-		dn = of_parse_phandle(dev->of_node, adj->phandle_name, 0);
 		if (!dn) {
 			dev_warn(dev, "Failed to parse %s\n",
 				 adj->phandle_name);
@@ -138,12 +138,10 @@ static int imx_icc_node_init_qos(struct icc_provider *provider,
 		if (!of_device_is_available(dn)) {
 			dev_warn(dev, "Missing property %s, skip scaling %s\n",
 				 adj->phandle_name, node->name);
-			of_node_put(dn);
 			return 0;
 		}
 
 		pdev = of_find_device_by_node(dn);
-		of_node_put(dn);
 		if (!pdev) {
 			dev_warn(dev, "node %s[%d] missing device for %pOF\n",
 				 node->name, node->id, dn);
-- 
2.34.1


^ permalink raw reply related

* Status of thermal support for i.MX93
From: Stefan Wahren @ 2026-04-08 15:28 UTC (permalink / raw)
  To: Jacky Bai, Alice Guo, Frank Li
  Cc: Fabio Estevam, imx@lists.linux.dev, Linux ARM,
	open list:GENERIC PM DOMAINS, Daniel Lezcano, Sascha Hauer

Hi,

AFAIK the thermal support for i.MX93 hasn't been mainlined yet. The last 
version I can find is here [1].

Are there any plans to finish this work?

Thanks

[1] - 
https://lore.kernel.org/linux-arm-kernel/d9392dbc-806a-41df-8992-28c3d6132309@linaro.org/#t 


^ permalink raw reply

* Re: [patch 01/12] clockevents: Prevent timer interrupt starvation
From: Thomas Gleixner @ 2026-04-08 15:18 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: LKML, Calvin Owens, Peter Zijlstra, Anna-Maria Behnsen,
	Frederic Weisbecker, Ingo Molnar, John Stultz, Stephen Boyd,
	Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
	Sebastian Reichel, linux-pm, Pablo Neira Ayuso, Florian Westphal,
	Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <20260408155353-42aeefa4-db66-48aa-ab07-0538a8cfdbf0@linutronix.de>

On Wed, Apr 08 2026 at 15:55, Thomas Weißschuh wrote:
> On Wed, Apr 08, 2026 at 02:41:20PM +0200, Thomas Weißschuh wrote:
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -369,7 +369,7 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, b
>         if (dev->next_event_forced)
>                 return 0;
>  
> -       if (dev->set_next_event(dev->min_delta_ticks, dev)) {
> +       if (dev->set_next_event(dev->min_delta_ns, dev)) {

That's wrong as the callback expects cycles (ticks) not nanoseconds.

I've just pushed out an updated version to tip timers/urgent which
addresses a potentially related issue. Delta patch below.

Thanks,

        tglx
---
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -324,6 +324,8 @@ int clockevents_program_event(struct clo
 		return dev->set_next_ktime(expires, dev);
 
 	delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
+	if (delta <= 0 && !force)
+		return -ETIME;
 
 	if (delta > (int64_t)dev->min_delta_ns) {
 		delta = min(delta, (int64_t) dev->max_delta_ns);

^ permalink raw reply

* Re: [PATCH RESEND v1] thermal: core: fix blocking in unregistering zone
From: Rafael J. Wysocki @ 2026-04-08 15:05 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Rafael J. Wysocki, Jiajia Liu, Daniel Lezcano, Zhang Rui,
	Lukasz Luba, linux-pm, linux-kernel, Armin Wolf, linux-hwmon
In-Reply-To: <e5638cf8-88ee-4b61-b032-6cf324b7c642@roeck-us.net>

On Sun, Apr 5, 2026 at 5:34 AM Guenter Roeck <linux@roeck-us.net> wrote:
>
> On 4/4/26 10:38, Rafael J. Wysocki wrote:
> > On Sat, Apr 4, 2026 at 4:02 PM Guenter Roeck <linux@roeck-us.net> wrote:
> >>
> >> On 4/4/26 05:58, Rafael J. Wysocki wrote:
> >>> On Fri, Apr 3, 2026 at 4:20 PM Guenter Roeck <linux@roeck-us.net> wrote:
> >>>>
> >>>> On 4/3/26 05:52, Rafael J. Wysocki wrote:
> >>>> .[ ... ]
> >>>>> It appears to work for me, but I'm not sure if having multiple hwmon class
> >>>>> devices with the same value in the name attribute is fine.
> >>>>
> >>>> Like this ?
> >>>>
> >>>> $ cd /sys/class/hwmon
> >>>> $ grep . */name
> >>>> hwmon0/name:r8169_0_c00:00
> >>>> hwmon1/name:nvme
> >>>> hwmon2/name:nvme
> >>>> hwmon3/name:nct6687
> >>>> hwmon4/name:k10temp
> >>>> hwmon5/name:spd5118
> >>>> hwmon6/name:spd5118
> >>>> hwmon7/name:spd5118
> >>>> hwmon8/name:spd5118
> >>>> hwmon9/name:mt7921_phy0
> >>>
> >>> Yes.
> >>>
> >>>> Names such as "r8169_0_c00:00" and "mt7921_phy0" are actually overkill
> >>>> since the "sensors" command makes it
> >>>>
> >>>> r8169_0_c00:00-mdio-0
> >>>> Adapter: MDIO adapter
> >>>> temp1:        +36.0°C  (high = +120.0°C)
> >>>>
> >>>> mt7921_phy0-pci-0d00
> >>>> Adapter: PCI adapter
> >>>> temp1:        +30.0°C
> >>>>
> >>>> essentially duplicating the device index.
> >>>
> >>> Well, with the patch posted by me, the output of sensors from a test
> >>> system looks like this:
> >>>
> >>> acpitz-acpi-0
> >>> Adapter: ACPI interface
> >>> temp1:        +16.8°C
> >>>
> >>> pch_cannonlake-virtual-0
> >>> Adapter: Virtual device
> >>> temp1:        +33.0°C
> >>>
> >>> acpitz-acpi-0
> >>> Adapter: ACPI interface
> >>> temp1:        +27.8°C
> >>>
> >>> (some further data excluded), which is kind of confusing (note the
> >>> duplicate acpitz-acpi-0 entries with different values of temp1).
> >>>
> >>
> >> Yes, agreed, that is confusing. I would have expected the second one
> >> to be identified as "acpitz-acpi-1". Do they both have the same parent ?
> >
> > No, they don't.
> >
> > The parent of each of them is a thermal zone device and both parents
> > have the same "type" value.
> >
> >>> That could be disambiguated by concatenating the thermal zone ID
> >>> (possibly after a '_') to the name.  Or the "temp*" things for thermal
> >>> zones of the same type could carry different numbers.
> >>>
> >>> A less attractive alternative would be to register a special virtual
> >>> device serving as a parent for all hwmon interfaces registered
> >>> automatically for thermal zones.
> >>
> >> If they all have the same parent, technically it should be a single
> >> hwmon device with multiple sensors, as in:
> >>
> >> acpitz-acpi-0
> >> Adapter: ACPI interface
> >> temp1:        +16.8°C
> >> temp2:        +27.8°C
> >
> > So somebody tried to make it look like that by registering hwmon
> > interfaces for all of the thermal zones of the same type under one of
> > them, but that (quite obviously) doesn't work.
>
> Not sure I understand why that doesn't work or why that is obvious,
> but I'll take you by your word (I would agree that the current
> _implementation_ looks problematic).

For example, say that there are two ACPI thermal zones on a system

/sys/devices/virtual/thermal/thermal_zone0/
/sys/devices/virtual/thermal/thermal_zone1/

The current mainline code registers a hwmon class device for thermal_zone0 only:

/sys/devices/virtual/thermal/thermal_zone0/hwmon0/

because the type is "acpitz" for both of them, but it adds a sysfs
attribute that belongs to thermal_zone1 under it:

/sys/devices/virtual/thermal/thermal_zone0/hwmon0/temp2_input

There is also

/sys/devices/virtual/thermal/thermal_zone0/hwmon0/temp1_input

but it belongs to thermal_zone0.

Interesting things happen when thermal_zone0 is removed, for example
because the ACPI thermal driver is unbound from the underlying
platform device.  Namely, the removal code skips the removal of hwmon0
because of the temp2_input attribute belonging to thermal_zone1 which
effectively prevents thermal_zone0 removal from making progress.

AFAICS, nothing particularly smart can be done to address this issue
while retaining the current design of the code.  Reparenting hwmon0 to
thermal_zone1 may confuse user space as well as removing hwmon0 along
with temp2_input.  That's why I think that this is a design issue.

> I looked into the source code of the "sensors" command. It indeed does
> not index ACPI devices (nor virtual devices, for that matter) but
> assumes that such devices are unique. My apologies for not realizing
> this earlier.
>
> So your only option is indeed to index the chip name if you want/need
> more than one hwmon device with the same base name (here: acpitz)
> instantiated from the thermal subsystem.
>
> One comment to one of your earlier e-mails:
>
> "However, it is more of a design issue IMV because putting temperature
>   attributes for all of the (possibly unrelated) thermal zones of the
>   same type under one hwmon interface is not particularly useful"
>
> A single hardware monitoring device, by design, serves multiple
> thermal zones. Anything else would not make sense for multi-channel
> hardware monitoring chips. The hardware monitoring subsystem groups
> sensors by chip, not by thermal zones.
>
> Having said this: This discussion is not new. Certain subsystems
> advocate for having one hardware monitoring device per sensor,
> not per chip. One submitter went as far as telling me that I am
> clueless. We don't need to repeat the exercise. I have my opinion,
> you have yours, and all we can do is to agree to disagree.

I'm not sure if this has anything to do with hardware monitoring chips
because hwmon_device_register_for_thermal() sets the chip argument of
__hwmon_device_register() to NULL, so the chip information is missing
in this particular case.  The underlying hardware may or may not be a
multi-channel hardware monitoring chip, that is hard to tell in
general.

In the particular case of ACPI thermal zones, they each correspond to
a different platform device and regarding those as different channels
of the same hardware monitoring chip is kind of a stretch IMV (they
may even be located at different places in the device hierarchy).

Regardless, it should be possible to remove each of them cleanly
because they are handled by the driver independently.

^ permalink raw reply

* Re: [PATCH v2 0/7] thermal: samsung: Add support for Google GS101 TMU
From: Alexey Klimov @ 2026-04-08 14:49 UTC (permalink / raw)
  To: Tudor Ambarus
  Cc: Rafael J. Wysocki, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Krzysztof Kozlowski, Alim Akhtar, Bartlomiej Zolnierkiewicz,
	Kees Cook, Gustavo A. R. Silva, Peter Griffin, André Draszik,
	willmcvicker, jyescas, shin.son, linux-samsung-soc, linux-kernel,
	linux-pm, devicetree, linux-arm-kernel, linux-hardening
In-Reply-To: <20260119-acpm-tmu-v2-0-e02a834f04c6@linaro.org>

On Mon Jan 19, 2026 at 12:08 PM GMT, Tudor Ambarus wrote:
> Add support for the Thermal Management Unit (TMU) on the Google GS101
> SoC.
>
> The GS101 TMU implementation utilizes a hybrid architecture where
> management is shared between the kernel and the Alive Clock and
> Power Manager (ACPM) firmware.

Do you plan to update or work on this series? If, by some reason,
this series is postphoned I can rebase it and re-send, for example.
IIRC it needs a clean rebase as a minimial change.

I am constructing some code on top of it, so it will be nice to have
newer version that can be (re-)tested for Exynos850.

Thanks,
Alexey

[...]

^ permalink raw reply

* [PATCH v2] cpufreq: Fix hotplug-suspend race during reboot
From: Tianxiang Chen @ 2026-04-08 14:19 UTC (permalink / raw)
  To: rafael; +Cc: viresh.kumar, lingyue, linux-pm, linux-kernel, Tianxiang Chen
In-Reply-To: <CAJZ5v0ie54h2aK05qNZTWNw5bu7GZDgsxM55KSsuF=ReLMkm-w@mail.gmail.com>

During system reboot, cpufreq_suspend() is called via the
kernel_restart() -> device_shutdown() -> pm_notifier_call_chain()
path. Unlike the normal system suspend path, the reboot path does not
call freeze_processes(), so userspace processes and kernel threads
remain active.

This allows CPU hotplug operations to run concurrently with
cpufreq_suspend(). The original code has no synchronization with CPU
hotplug, leading to a race condition where governor_data can be freed
by the hotplug path while cpufreq_suspend() is still accessing it,
resulting in a null pointer dereference:

  Unable to handle kernel NULL pointer dereference
  Call Trace:
   do_kernel_fault+0x28/0x3c
   cpufreq_suspend+0xdc/0x160
   device_shutdown+0x18/0x200
   kernel_restart+0x40/0x80
   arm64_sys_reboot+0x1b0/0x200

Fix this by adding cpus_read_lock()/cpus_read_unlock() to
cpufreq_suspend() to block CPU hotplug operations while suspend is in
progress.

Signed-off-by: Tianxiang Chen <nanmu@xiaomi.com>
---
v2:
- Update changelog to explicitly mention reboot scenario
- Add observed crash trace
---
 drivers/cpufreq/cpufreq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 1f794524a1d9..6f1d264c378b 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1979,6 +1979,7 @@ void cpufreq_suspend(void)
 	if (!cpufreq_driver)
 		return;
 
+	cpus_read_lock();
 	if (!has_target() && !cpufreq_driver->suspend)
 		goto suspend;
 
@@ -1998,6 +1999,7 @@ void cpufreq_suspend(void)
 
 suspend:
 	cpufreq_suspended = true;
+	cpus_read_unlock();
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related

* Re: [patch 01/12] clockevents: Prevent timer interrupt starvation
From: Frederic Weisbecker @ 2026-04-08 14:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Calvin Owens, Peter Zijlstra, Anna-Maria Behnsen,
	Ingo Molnar, John Stultz, Stephen Boyd, Alexander Viro,
	Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
	linux-pm, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	netfilter-devel, coreteam
In-Reply-To: <20260407083247.562657657@kernel.org>

Le Tue, Apr 07, 2026 at 10:54:17AM +0200, Thomas Gleixner a écrit :
> From: Thomas Gleixner <tglx@kernel.org>
> 
> Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
> up in user space. He provided a reproducer, which sets up a timerfd based
> timer and then rearms it in a loop with an absolute expiry time of 1ns.
> 
> As the expiry time is in the past, the timer ends up as the first expiring
> timer in the per CPU hrtimer base and the clockevent device is programmed
> with the minimum delta value. If the machine is fast enough, this ends up
> in a endless loop of programming the delta value to the minimum value
> defined by the clock event device, before the timer interrupt can fire,
> which starves the interrupt and consequently triggers the lockup detector
> because the hrtimer callback of the lockup mechanism is never invoked.
> 
> As a first step to prevent this, avoid reprogramming the clock event device
> when:
>      - a forced minimum delta event is pending
>      - the new expiry delta is less then or equal to the minimum delta
> 
> Thanks to Calvin for providing the reproducer and to Borislav for testing
> and providing data from his Zen5 machine.
> 
> The problem is not limited to Zen5, but depending on the underlying
> clock event device (e.g. TSC deadline timer on Intel) and the CPU speed
> not necessarily observable.
> 
> This change serves only as the last resort and further changes will be made
> to prevent this scenario earlier in the call chain as far as possible.
> 
> Fixes: d316c57ff6bf ("[PATCH] clockevents: add core functionality")
> Reported-by: Calvin Owens <calvin@wbinvd.org>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Link: https://lore.kernel.org/lkml/acMe-QZUel-bBYUh@mozart.vkv.me/

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply

* [PATCH] pmdomain: mediatek: fix use-after-free in scpsys_get_bus_protection_legacy()
From: Wentao Liang @ 2026-04-08 14:11 UTC (permalink / raw)
  To: Ulf Hansson, Matthias Brugger, AngeloGioacchino Del Regno
  Cc: nfraprado, Macpaul Lin, Adam Ford, Chen-Yu Tsai, linux-pm,
	linux-kernel, linux-arm-kernel, linux-mediatek, Wentao Liang,
	stable

In scpsys_get_bus_protection_legacy(), of_find_node_with_property()
returns a device node with its reference count incremented. The function
then calls of_node_put(node) before checking whether
syscon_regmap_lookup_by_phandle() returns an error. If an error occurs,
dev_err_probe() dereferences the node pointer to print diagnostic
information, but the node memory may have already been freed due to the
earlier of_node_put(), leading to a use-after-free vulnerability.

Fix this by moving the of_node_put() call after the error check, ensuring
the node is still valid when accessed in the error path.

Fixes: c29345fa5f66 ("pmdomain: mediatek: Refactor bus protection regmaps retrieval")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
 drivers/pmdomain/mediatek/mtk-pm-domains.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/pmdomain/mediatek/mtk-pm-domains.c b/drivers/pmdomain/mediatek/mtk-pm-domains.c
index e2800aa1bc59..d3b36f32417c 100644
--- a/drivers/pmdomain/mediatek/mtk-pm-domains.c
+++ b/drivers/pmdomain/mediatek/mtk-pm-domains.c
@@ -993,6 +993,7 @@ static int scpsys_get_bus_protection_legacy(struct device *dev, struct scpsys *s
 	struct device_node *node, *smi_np;
 	int num_regmaps = 0, i, j;
 	struct regmap *regmap[3];
+	int ret = 0;
 
 	/*
 	 * Legacy code retrieves a maximum of three bus protection handles:
@@ -1043,11 +1044,14 @@ static int scpsys_get_bus_protection_legacy(struct device *dev, struct scpsys *s
 	if (node) {
 		regmap[2] = syscon_regmap_lookup_by_phandle(node, "mediatek,infracfg-nao");
 		num_regmaps++;
-		of_node_put(node);
-		if (IS_ERR(regmap[2]))
-			return dev_err_probe(dev, PTR_ERR(regmap[2]),
+		if (IS_ERR(regmap[2])) {
+			ret = dev_err_probe(dev, PTR_ERR(regmap[2]),
 					     "%pOF: failed to get infracfg regmap\n",
 					     node);
+			of_node_put(node);
+			return ret;
+		}
+		of_node_put(node);
 	} else {
 		regmap[2] = NULL;
 	}
-- 
2.34.1


^ permalink raw reply related

* Re: [patch 01/12] clockevents: Prevent timer interrupt starvation
From: Thomas Weißschuh @ 2026-04-08 13:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Calvin Owens, Peter Zijlstra, Anna-Maria Behnsen,
	Frederic Weisbecker, Ingo Molnar, John Stultz, Stephen Boyd,
	Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
	Sebastian Reichel, linux-pm, Pablo Neira Ayuso, Florian Westphal,
	Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <20260408143313-ac6c3b82-70e6-4ce3-b33a-20f5e6ba160b@linutronix.de>

On Wed, Apr 08, 2026 at 02:41:20PM +0200, Thomas Weißschuh wrote:
> Hi Thomas,
> 
> On Tue, Apr 07, 2026 at 10:54:17AM +0200, Thomas Gleixner wrote:
> > From: Thomas Gleixner <tglx@kernel.org>
> > 
> > Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
> > up in user space. He provided a reproducer, which sets up a timerfd based
> > timer and then rearms it in a loop with an absolute expiry time of 1ns.
> > 
> > As the expiry time is in the past, the timer ends up as the first expiring
> > timer in the per CPU hrtimer base and the clockevent device is programmed
> > with the minimum delta value. If the machine is fast enough, this ends up
> > in a endless loop of programming the delta value to the minimum value
> > defined by the clock event device, before the timer interrupt can fire,
> > which starves the interrupt and consequently triggers the lockup detector
> > because the hrtimer callback of the lockup mechanism is never invoked.
> > 
> > As a first step to prevent this, avoid reprogramming the clock event device
> > when:
> >      - a forced minimum delta event is pending
> >      - the new expiry delta is less then or equal to the minimum delta
> 
> with this patch now in the tip tree my QEMU/virtme-ng based machine
> fails to boot. The startup seems to freeze in:
> start_kernel() -> rest_init() -> schedule_preempt_disabled() -> schedule()
> 
> CONFIG_GENERIC_CLOCKEVENTS=y
> CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
> CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE=y
> CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
> CONFIG_HZ=1000
> 
> CPU: i5-1135G7
> clock event device: lapic-deadline
> 
> The clockevent device is still reprogrammed each millisecond,
> presumably for the tick.
> 
> (...)

This fixes the issue for me:

--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -369,7 +369,7 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, b
        if (dev->next_event_forced)
                return 0;
 
-       if (dev->set_next_event(dev->min_delta_ticks, dev)) {
+       if (dev->set_next_event(dev->min_delta_ns, dev)) {
                if (!force || clockevents_program_min_delta(dev))
                        return -ETIME;
        }


Thomas

^ permalink raw reply

* Re: [patch 01/12] clockevents: Prevent timer interrupt starvation
From: Thomas Weißschuh @ 2026-04-08 12:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Calvin Owens, Peter Zijlstra, Anna-Maria Behnsen,
	Frederic Weisbecker, Ingo Molnar, John Stultz, Stephen Boyd,
	Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
	Sebastian Reichel, linux-pm, Pablo Neira Ayuso, Florian Westphal,
	Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <20260407083247.562657657@kernel.org>

Hi Thomas,

On Tue, Apr 07, 2026 at 10:54:17AM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@kernel.org>
> 
> Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
> up in user space. He provided a reproducer, which sets up a timerfd based
> timer and then rearms it in a loop with an absolute expiry time of 1ns.
> 
> As the expiry time is in the past, the timer ends up as the first expiring
> timer in the per CPU hrtimer base and the clockevent device is programmed
> with the minimum delta value. If the machine is fast enough, this ends up
> in a endless loop of programming the delta value to the minimum value
> defined by the clock event device, before the timer interrupt can fire,
> which starves the interrupt and consequently triggers the lockup detector
> because the hrtimer callback of the lockup mechanism is never invoked.
> 
> As a first step to prevent this, avoid reprogramming the clock event device
> when:
>      - a forced minimum delta event is pending
>      - the new expiry delta is less then or equal to the minimum delta

with this patch now in the tip tree my QEMU/virtme-ng based machine
fails to boot. The startup seems to freeze in:
start_kernel() -> rest_init() -> schedule_preempt_disabled() -> schedule()

CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_HZ=1000

CPU: i5-1135G7
clock event device: lapic-deadline

The clockevent device is still reprogrammed each millisecond,
presumably for the tick.

(...)


Thomas

^ permalink raw reply

* [PATCH v11 11/14] sched: add need-resched timed wait interface
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora, Ingo Molnar
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

Add tif_bitset_relaxed_wait() (and tif_need_resched_relaxed_wait()
which wraps it) which takes the thread_info bit and timeout duration
as parameters and waits until the bit is set or for the expiration
of the timeout.

The wait is implemented via smp_cond_load_relaxed_timeout().

smp_cond_load_relaxed_timeout() essentially provides the pattern used
in poll_idle() where we spin in a loop waiting for the flag to change
until a timeout occurs.

tif_need_resched_relaxed_wait() allows us to abstract out the internals
of waiting, scheduler specific details etc.

Placed in linux/sched/idle.h instead of linux/thread_info.h to work
around recursive include hell.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-pm@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched/idle.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 8465ff1f20d1..ddee9b019895 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -3,6 +3,7 @@
 #define _LINUX_SCHED_IDLE_H
 
 #include <linux/sched.h>
+#include <linux/sched/clock.h>
 
 enum cpu_idle_type {
 	__CPU_NOT_IDLE = 0,
@@ -113,4 +114,32 @@ static __always_inline void current_clr_polling(void)
 }
 #endif
 
+/*
+ * Caller needs to make sure that the thread context cannot be preempted
+ * or migrated, so current_thread_info() cannot change from under us.
+ *
+ * This also allows us to safely stay in the local_clock domain.
+ */
+static __always_inline bool tif_bitset_relaxed_wait(int tif, u64 timeout_ns)
+{
+	unsigned long flags;
+
+	flags = smp_cond_load_relaxed_timeout(&current_thread_info()->flags,
+					      (VAL & BIT(tif)),
+					      local_clock_noinstr(),
+					      timeout_ns);
+	return flags & BIT(tif);
+}
+
+/**
+ * tif_need_resched_relaxed_wait() - Wait for need-resched being set
+ * with no ordering guarantees until a timeout expires.
+ *
+ * @timeout_ns: timeout value.
+ */
+static __always_inline bool tif_need_resched_relaxed_wait(u64 timeout_ns)
+{
+	return tif_bitset_relaxed_wait(TIF_NEED_RESCHED, timeout_ns);
+}
+
 #endif /* _LINUX_SCHED_IDLE_H */
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 09/14] bpf/rqspinlock: switch check_timeout() to a clock interface
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

check_timeout() gets the current time value and depending on how
much time has passed, checks for deadlock or times out, returning 0
or -errno on deadlock or timeout.

Switch this out to a clock style interface, where it functions as a
clock in the "lock-domain", returning the current time until a
deadlock or timeout occurs. Once a deadlock or timeout has occurred,
it stops functioning as a clock and returns error.

Also adjust the RES_CHECK_TIMEOUT macro to discard the clock value
when updating the explicit return status.

Cc: bpf@vger.kernel.org
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/bpf/rqspinlock.c | 45 +++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/rqspinlock.c b/kernel/bpf/rqspinlock.c
index e4e338cdb437..0ec17ebb67c1 100644
--- a/kernel/bpf/rqspinlock.c
+++ b/kernel/bpf/rqspinlock.c
@@ -196,8 +196,12 @@ static noinline int check_deadlock_ABBA(rqspinlock_t *lock, u32 mask)
 	return 0;
 }
 
-static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
-				  struct rqspinlock_timeout *ts)
+/*
+ * Returns current monotonic time in ns on success or, negative errno
+ * value on failure due to timeout expiration or detection of deadlock.
+ */
+static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
+				   struct rqspinlock_timeout *ts)
 {
 	u64 prev = ts->cur;
 	u64 time;
@@ -207,7 +211,7 @@ static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
 			return -EDEADLK;
 		ts->cur = ktime_get_mono_fast_ns();
 		ts->timeout_end = ts->cur + ts->duration;
-		return 0;
+		return (s64)ts->cur;
 	}
 
 	time = ktime_get_mono_fast_ns();
@@ -219,11 +223,15 @@ static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
 	 * checks.
 	 */
 	if (prev + NSEC_PER_MSEC < time) {
+		int ret;
 		ts->cur = time;
-		return check_deadlock_ABBA(lock, mask);
+		ret = check_deadlock_ABBA(lock, mask);
+		if (ret)
+			return ret;
+
 	}
 
-	return 0;
+	return (s64)time;
 }
 
 /*
@@ -231,15 +239,22 @@ static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
  * as the macro does internal amortization for us.
  */
 #ifndef res_smp_cond_load_acquire
-#define RES_CHECK_TIMEOUT(ts, ret, mask)                              \
-	({                                                            \
-		if (!(ts).spin++)                                     \
-			(ret) = check_timeout((lock), (mask), &(ts)); \
-		(ret);                                                \
+#define RES_CHECK_TIMEOUT(ts, ret, mask)					\
+	({									\
+		s64 __timeval_err = 0;						\
+		if (!(ts).spin++)						\
+			__timeval_err = clock_deadlock((lock), (mask), &(ts));	\
+		(ret) = __timeval_err < 0 ? __timeval_err : 0;			\
+		__timeval_err;							\
 	})
 #else
-#define RES_CHECK_TIMEOUT(ts, ret, mask)			      \
-	({ (ret) = check_timeout((lock), (mask), &(ts)); })
+#define RES_CHECK_TIMEOUT(ts, ret, mask)					\
+	({									\
+		s64 __timeval_err;						\
+		__timeval_err = clock_deadlock((lock), (mask), &(ts));		\
+		(ret) = __timeval_err < 0 ? __timeval_err : 0;			\
+		__timeval_err;							\
+	})
 #endif
 
 /*
@@ -281,7 +296,7 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock)
 	val = atomic_read(&lock->val);
 
 	if (val || !atomic_try_cmpxchg(&lock->val, &val, 1)) {
-		if (RES_CHECK_TIMEOUT(ts, ret, ~0u))
+		if (RES_CHECK_TIMEOUT(ts, ret, ~0u) < 0)
 			goto out;
 		cpu_relax();
 		goto retry;
@@ -406,7 +421,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 */
 	if (val & _Q_LOCKED_MASK) {
 		RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
-		res_smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK));
+		res_smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK) < 0);
 	}
 
 	if (ret) {
@@ -568,7 +583,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 */
 	RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT * 2);
 	val = res_atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
-					   RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK));
+					   RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK) < 0);
 
 	/* Disable queue destruction when we detect deadlocks. */
 	if (ret == -EDEADLK) {
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 02/14] arm64: barrier: Support smp_cond_load_relaxed_timeout()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

Support waiting in smp_cond_load_relaxed_timeout() via
__cmpwait_relaxed(). To ensure that we wake from waiting in
WFE periodically and don't block forever if there are no stores
to ptr, this path is only used when the event-stream is enabled.

Note that when using __cmpwait_relaxed() we ignore the timeout
value, allowing an overshoot by up to the event-stream period.
And, in the unlikely event that the event-stream is unavailable,
fallback to spin-waiting.

Also set SMP_TIMEOUT_POLL_COUNT to 1 so we do the time-check in
each iteration of smp_cond_load_relaxed_timeout().

And finally define ARCH_HAS_CPU_RELAX to indicate that we have
an optimized implementation of cpu_poll_relax().

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Suggested-by: Will Deacon <will@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/Kconfig               |  3 +++
 arch/arm64/include/asm/barrier.h | 21 +++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9ea19b74b6c3..e3ce08276e9b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1628,6 +1628,9 @@ config ARCH_SUPPORTS_CRASH_DUMP
 config ARCH_DEFAULT_CRASH_DUMP
 	def_bool y
 
+config ARCH_HAS_CPU_RELAX
+	def_bool y
+
 config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
 	def_bool CRASH_RESERVE
 
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 9495c4441a46..6190e178db51 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -12,6 +12,7 @@
 #include <linux/kasan-checks.h>
 
 #include <asm/alternative-macros.h>
+#include <asm/vdso/processor.h>
 
 #define __nops(n)	".rept	" #n "\nnop\n.endr\n"
 #define nops(n)		asm volatile(__nops(n))
@@ -219,6 +220,26 @@ do {									\
 	(typeof(*ptr))VAL;						\
 })
 
+/* Re-declared here to avoid include dependency. */
+extern bool arch_timer_evtstrm_available(void);
+
+/*
+ * In the common case, cpu_poll_relax() sits waiting in __cmpwait_relaxed()
+ * for the ptr value to change.
+ *
+ * Since this period is reasonably long, choose SMP_TIMEOUT_POLL_COUNT
+ * to be 1, so smp_cond_load_{relaxed,acquire}_timeout() does a
+ * time-check in each iteration.
+ */
+#define SMP_TIMEOUT_POLL_COUNT	1
+
+#define cpu_poll_relax(ptr, val, timeout_ns) do {			\
+	if (arch_timer_evtstrm_available())				\
+		__cmpwait_relaxed(ptr, val);				\
+	else								\
+		cpu_relax();						\
+} while (0)
+
 #include <asm-generic/barrier.h>
 
 #endif	/* __ASSEMBLER__ */
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 14/14] kunit: add tests for smp_cond_load_relaxed_timeout()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

Add a success and failure case for smp_cond_load_relaxed_timeout().

Both test cases wait on some state in smp_cond_load_relaxed_timeout().
In the success case we spawn a kthread that pokes the bit.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 lib/Kconfig.debug                |  10 +++
 lib/tests/Makefile               |   1 +
 lib/tests/barrier-timeout-test.c | 125 +++++++++++++++++++++++++++++++
 3 files changed, 136 insertions(+)
 create mode 100644 lib/tests/barrier-timeout-test.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 93f356d2b3d9..dcd2d60a9391 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2398,6 +2398,16 @@ config FPROBE_SANITY_TEST
 
 	  Say N if you are unsure.
 
+config BARRIER_TIMEOUT_TEST
+	tristate "KUnit tests for smp_cond_load_relaxed_timeout()"
+	depends on KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  Builds KUnit tests that validate wake-up and timeout handling paths
+	  in smp_cond_load_relaxed_timeout().
+
+	  Say N if you are unsure.
+
 config BACKTRACE_SELF_TEST
 	tristate "Self test for the backtrace code"
 	depends on DEBUG_KERNEL
diff --git a/lib/tests/Makefile b/lib/tests/Makefile
index 05f74edbc62b..3504d677b7b8 100644
--- a/lib/tests/Makefile
+++ b/lib/tests/Makefile
@@ -20,6 +20,7 @@ CFLAGS_fortify_kunit.o += $(DISABLE_STRUCTLEAK_PLUGIN)
 obj-$(CONFIG_FORTIFY_KUNIT_TEST) += fortify_kunit.o
 CFLAGS_test_fprobe.o += $(CC_FLAGS_FTRACE)
 obj-$(CONFIG_FPROBE_SANITY_TEST) += test_fprobe.o
+obj-$(CONFIG_BARRIER_TIMEOUT_TEST) += barrier-timeout-test.o
 obj-$(CONFIG_GLOB_KUNIT_TEST) += glob_kunit.o
 obj-$(CONFIG_HASHTABLE_KUNIT_TEST) += hashtable_test.o
 obj-$(CONFIG_HASH_KUNIT_TEST) += test_hash.o
diff --git a/lib/tests/barrier-timeout-test.c b/lib/tests/barrier-timeout-test.c
new file mode 100644
index 000000000000..d72200daa0f2
--- /dev/null
+++ b/lib/tests/barrier-timeout-test.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit tests exercising smp_cond_load_relaxed_timeout().
+ *
+ * Copyright (c) 2026, Oracle Corp.
+ * Author: Ankur Arora <ankur.a.arora@oracle.com>
+ */
+
+#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/sched/clock.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <asm/barrier.h>
+#include <kunit/test.h>
+#include <kunit/visibility.h>
+
+MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");
+
+struct clock_state {
+	s64 start_time;
+	s64 end_time;
+};
+
+#define TIMEOUT_MSEC	2
+#define TEST_FLAG_VAL	BIT(2)
+static unsigned int flag;
+
+static s64 basic_clock(struct clock_state *clk)
+{
+	clk->end_time = local_clock();
+	return clk->end_time;
+}
+
+static void update_flags(void)
+{
+	WRITE_ONCE(flag, TEST_FLAG_VAL);
+}
+
+static s64 mock_clock(struct clock_state *clk)
+{
+	s64 clk_mid = clk->start_time + (TIMEOUT_MSEC * NSEC_PER_MSEC)/2;
+
+	clk->end_time = local_clock();
+	if (clk->end_time >= clk_mid)
+		update_flags();
+	return clk->end_time;
+}
+
+typedef s64 (*clkfn_t)(struct clock_state *);
+
+static void test_smp_cond_relaxed_timeout(struct kunit *test,
+					  clkfn_t clock, bool succeeds)
+{
+	struct clock_state clk = {
+		.start_time = local_clock(),
+		.end_time = local_clock(),
+	};
+	s64 runtime, timeout_ns = TIMEOUT_MSEC * NSEC_PER_MSEC;
+	unsigned int result;
+
+	result = smp_cond_load_relaxed_timeout(&flag,
+					       (VAL & TEST_FLAG_VAL),
+					       clock(&clk),
+					       timeout_ns);
+
+	runtime = clk.end_time - clk.start_time;
+	KUNIT_EXPECT_EQ(test, (bool)(result & TEST_FLAG_VAL), succeeds);
+	KUNIT_EXPECT_EQ(test, runtime <= timeout_ns, succeeds);
+}
+
+static int smp_cond_threadfn(void *data)
+{
+	udelay(TIMEOUT_MSEC * USEC_PER_MSEC / 4);
+
+	/*
+	 * Update flags after a delay to give smp_cond_relaxed_timeout()
+	 * time to get started.
+	 */
+	update_flags();
+	return 0;
+}
+
+static void smp_cond_relaxed_timeout_succeeds(struct kunit *test)
+{
+	struct task_struct *task;
+
+	flag = 0;
+
+	task = kthread_run(smp_cond_threadfn, &flag, "smp_cond_thread");
+
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, task);
+	test_smp_cond_relaxed_timeout(test, &basic_clock, true);
+
+	kthread_stop(task);
+}
+
+static void smp_cond_relaxed_timeout_mocked(struct kunit *test)
+{
+	flag = 0;
+	test_smp_cond_relaxed_timeout(test, &mock_clock, true);
+}
+
+static void smp_cond_relaxed_timeout_expires(struct kunit *test)
+{
+	flag = 0;
+	test_smp_cond_relaxed_timeout(test, &basic_clock, false);
+}
+
+static struct kunit_case barrier_timeout_test_cases[] = {
+	KUNIT_CASE(smp_cond_relaxed_timeout_mocked),
+	KUNIT_CASE(smp_cond_relaxed_timeout_succeeds),
+	KUNIT_CASE(smp_cond_relaxed_timeout_expires),
+	{}
+};
+
+static struct kunit_suite barrier_timeout_test_suite = {
+	.name = "smp-cond-load-relaxed-timeout",
+	.test_cases = barrier_timeout_test_cases,
+};
+
+kunit_test_suite(barrier_timeout_test_suite);
+
+MODULE_DESCRIPTION("KUnit tests for smp_cond_load_relaxed_timeout()");
+MODULE_LICENSE("GPL");
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 13/14] kunit: enable testing smp_cond_load_relaxed_timeout()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

This enables the barrier tests to be built as a module.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/lib/delay.c               | 2 ++
 drivers/clocksource/arm_arch_timer.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index c660a7ea26dd..dfb102ce3009 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -12,6 +12,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/timex.h>
+#include <kunit/visibility.h>
 #include <asm/delay-const.h>
 
 #include <clocksource/arm_arch_timer.h>
@@ -30,6 +31,7 @@ u64 notrace __delay_cycles(void)
 	guard(preempt_notrace)();
 	return __arch_counter_get_cntvct_stable();
 }
+EXPORT_SYMBOL_IF_KUNIT(__delay_cycles);
 
 void __delay(unsigned long cycles)
 {
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 90aeff44a276..1de63e1a2cd2 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -28,6 +28,7 @@
 #include <linux/acpi.h>
 #include <linux/arm-smccc.h>
 #include <linux/ptp_kvm.h>
+#include <kunit/visibility.h>
 
 #include <asm/arch_timer.h>
 #include <asm/virt.h>
@@ -896,6 +897,7 @@ bool arch_timer_evtstrm_available(void)
 	 */
 	return cpumask_test_cpu(raw_smp_processor_id(), &evtstrm_available);
 }
+EXPORT_SYMBOL_IF_KUNIT(arch_timer_evtstrm_available);
 
 static struct arch_timer_kvm_info arch_timer_kvm_info;
 
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 12/14] cpuidle/poll_state: Wait for need-resched via tif_need_resched_relaxed_wait()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

The inner loop in poll_idle() polls over the thread_info flags,
waiting to see if the thread has TIF_NEED_RESCHED set. The loop
exits once the condition is met, or if the poll time limit has
been exceeded.

To minimize the number of instructions executed in each iteration,
the time check is rate-limited. In addition, each loop iteration
executes cpu_relax() which on certain platforms provides a hint to
the pipeline that the loop busy-waits, allowing the processor to
reduce power consumption.

Switch over to tif_need_resched_relaxed_wait() instead, since that
provides exactly that.

However, since we want to minimize power consumption in idle, building
of cpuidle/poll_state.c continues to depend on CONFIG_ARCH_HAS_CPU_RELAX
as that serves as an indicator that the platform supports an optimized
version of tif_need_resched_relaxed_wait() (via
smp_cond_load_acquire_timeout()).

Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-pm@vger.kernel.org
Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/cpuidle/poll_state.c | 21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index c7524e4c522a..7443b3e971ba 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -6,41 +6,22 @@
 #include <linux/cpuidle.h>
 #include <linux/export.h>
 #include <linux/irqflags.h>
-#include <linux/sched.h>
-#include <linux/sched/clock.h>
 #include <linux/sched/idle.h>
 #include <linux/sprintf.h>
 #include <linux/types.h>
 
-#define POLL_IDLE_RELAX_COUNT	200
-
 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
-	u64 time_start;
-
-	time_start = local_clock_noinstr();
-
 	dev->poll_time_limit = false;
 
 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		unsigned int loop_count = 0;
 		u64 limit;
 
 		limit = cpuidle_poll_time(drv, dev);
 
-		while (!need_resched()) {
-			cpu_relax();
-			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
-				continue;
-
-			loop_count = 0;
-			if (local_clock_noinstr() - time_start > limit) {
-				dev->poll_time_limit = true;
-				break;
-			}
-		}
+		dev->poll_time_limit = !tif_need_resched_relaxed_wait(limit);
 	}
 	raw_local_irq_disable();
 
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 10/14] bpf/rqspinlock: Use smp_cond_load_acquire_timeout()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

Switch out the conditional load interfaces used by rqspinlock
to smp_cond_read_acquire_timeout() and its wrapper,
atomic_cond_read_acquire_timeout().

Both these handle the timeout and amortize as needed, so use the
non-amortized RES_CHECK_TIMEOUT.

RES_CHECK_TIMEOUT does double duty here -- presenting the current
clock value, the timeout/deadlock error from clock_deadlock() to
the cond-load and, returning the error value via ret.

For correctness, we need to ensure that the error case of the
cond-load interface always agrees with that in clock_deadlock().

For the most part, this is fine because there's no independent clock,
or double reads from the clock in cond-load -- either of which could
lead to its internal state going out of sync from that of
clock_deadlock().

There is, however, an edge case where clock_deadlock() checks for:

        if (time > ts->timeout_end)
                return -ETIMEDOUT;

while smp_cond_load_acquire_timeout() checks for:

        __time_now = (time_expr_ns);
        if (__time_now <= 0 || __time_now >= __time_end) {
                VAL = READ_ONCE(*__PTR);
                break;
        }

This runs into a problem when (__time_now == __time_end) since
clock_deadlock() does not treat it as a timeout condition but
the second clause in the conditional above does.
So, add an equality check in clock_deadlock().

Finally, redefine SMP_TIMEOUT_POLL_COUNT to be 16k to be similar to
the spin-count used in the amortized version. We only do this for
non-arm64 as that uses a waiting implementation.

Cc: bpf@vger.kernel.org
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/bpf/rqspinlock.c | 40 +++++++++++++++++++++++-----------------
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/rqspinlock.c b/kernel/bpf/rqspinlock.c
index 0ec17ebb67c1..e5e27266b813 100644
--- a/kernel/bpf/rqspinlock.c
+++ b/kernel/bpf/rqspinlock.c
@@ -215,7 +215,7 @@ static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
 	}
 
 	time = ktime_get_mono_fast_ns();
-	if (time > ts->timeout_end)
+	if (time >= ts->timeout_end)
 		return -ETIMEDOUT;
 
 	/*
@@ -235,11 +235,10 @@ static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
 }
 
 /*
- * Do not amortize with spins when res_smp_cond_load_acquire is defined,
- * as the macro does internal amortization for us.
+ * Spin amortized version of RES_CHECK_TIMEOUT. Used when busy-waiting in
+ * atomic_try_cmpxchg().
  */
-#ifndef res_smp_cond_load_acquire
-#define RES_CHECK_TIMEOUT(ts, ret, mask)					\
+#define RES_CHECK_TIMEOUT_AMORTIZED(ts, ret, mask)				\
 	({									\
 		s64 __timeval_err = 0;						\
 		if (!(ts).spin++)						\
@@ -247,7 +246,7 @@ static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
 		(ret) = __timeval_err < 0 ? __timeval_err : 0;			\
 		__timeval_err;							\
 	})
-#else
+
 #define RES_CHECK_TIMEOUT(ts, ret, mask)					\
 	({									\
 		s64 __timeval_err;						\
@@ -255,7 +254,6 @@ static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
 		(ret) = __timeval_err < 0 ? __timeval_err : 0;			\
 		__timeval_err;							\
 	})
-#endif
 
 /*
  * Initialize the 'spin' member.
@@ -269,6 +267,17 @@ static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
  */
 #define RES_RESET_TIMEOUT(ts, _duration) ({ (ts).timeout_end = 0; (ts).duration = _duration; })
 
+/*
+ * Limit how often we invoke clock_deadlock() while spin-waiting in
+ * smp_cond_load_acquire_timeout() or atomic_cond_read_acquire_timeout().
+ *
+ * We only override the default value not superceding ARM64's override.
+ */
+#ifndef CONFIG_ARM64
+#undef SMP_TIMEOUT_POLL_COUNT
+#define SMP_TIMEOUT_POLL_COUNT	(16*1024)
+#endif
+
 /*
  * Provide a test-and-set fallback for cases when queued spin lock support is
  * absent from the architecture.
@@ -296,7 +305,7 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock)
 	val = atomic_read(&lock->val);
 
 	if (val || !atomic_try_cmpxchg(&lock->val, &val, 1)) {
-		if (RES_CHECK_TIMEOUT(ts, ret, ~0u) < 0)
+		if (RES_CHECK_TIMEOUT_AMORTIZED(ts, ret, ~0u) < 0)
 			goto out;
 		cpu_relax();
 		goto retry;
@@ -319,12 +328,6 @@ EXPORT_SYMBOL_GPL(resilient_tas_spin_lock);
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, rqnodes[_Q_MAX_NODES]);
 
-#ifndef res_smp_cond_load_acquire
-#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire(v, c)
-#endif
-
-#define res_atomic_cond_read_acquire(v, c) res_smp_cond_load_acquire(&(v)->counter, (c))
-
 /**
  * resilient_queued_spin_lock_slowpath - acquire the queued spinlock
  * @lock: Pointer to queued spinlock structure
@@ -421,7 +424,9 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 */
 	if (val & _Q_LOCKED_MASK) {
 		RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
-		res_smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK) < 0);
+		smp_cond_load_acquire_timeout(&lock->locked, !VAL,
+					      RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK),
+					      ts.duration);
 	}
 
 	if (ret) {
@@ -582,8 +587,9 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 * us.
 	 */
 	RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT * 2);
-	val = res_atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
-					   RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK) < 0);
+	val = atomic_cond_read_acquire_timeout(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK),
+					       RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK),
+					       ts.duration);
 
 	/* Disable queue destruction when we detect deadlocks. */
 	if (ret == -EDEADLK) {
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 08/14] locking/atomic: scripts: build atomic_long_cond_read_*_timeout()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora, Boqun Feng
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

Add the atomic long wrappers for the cond-load timeout interfaces.

Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/atomic/atomic-long.h | 18 +++++++++++-------
 scripts/atomic/gen-atomic-long.sh  | 16 ++++++++++------
 2 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/include/linux/atomic/atomic-long.h b/include/linux/atomic/atomic-long.h
index 6a4e47d2db35..553b6b0e0258 100644
--- a/include/linux/atomic/atomic-long.h
+++ b/include/linux/atomic/atomic-long.h
@@ -11,14 +11,18 @@
 
 #ifdef CONFIG_64BIT
 typedef atomic64_t atomic_long_t;
-#define ATOMIC_LONG_INIT(i)		ATOMIC64_INIT(i)
-#define atomic_long_cond_read_acquire	atomic64_cond_read_acquire
-#define atomic_long_cond_read_relaxed	atomic64_cond_read_relaxed
+#define ATOMIC_LONG_INIT(i)			ATOMIC64_INIT(i)
+#define atomic_long_cond_read_acquire		atomic64_cond_read_acquire
+#define atomic_long_cond_read_relaxed		atomic64_cond_read_relaxed
+#define atomic_long_cond_read_acquire_timeout	atomic64_cond_read_acquire_timeout
+#define atomic_long_cond_read_relaxed_timeout	atomic64_cond_read_relaxed_timeout
 #else
 typedef atomic_t atomic_long_t;
-#define ATOMIC_LONG_INIT(i)		ATOMIC_INIT(i)
-#define atomic_long_cond_read_acquire	atomic_cond_read_acquire
-#define atomic_long_cond_read_relaxed	atomic_cond_read_relaxed
+#define ATOMIC_LONG_INIT(i)			ATOMIC_INIT(i)
+#define atomic_long_cond_read_acquire		atomic_cond_read_acquire
+#define atomic_long_cond_read_relaxed		atomic_cond_read_relaxed
+#define atomic_long_cond_read_acquire_timeout	atomic_cond_read_acquire_timeout
+#define atomic_long_cond_read_relaxed_timeout	atomic_cond_read_relaxed_timeout
 #endif
 
 /**
@@ -1809,4 +1813,4 @@ raw_atomic_long_dec_if_positive(atomic_long_t *v)
 }
 
 #endif /* _LINUX_ATOMIC_LONG_H */
-// 4b882bf19018602c10816c52f8b4ae280adc887b
+// 79c1f4acb5774376ceed559843d5d9ed1348df99
diff --git a/scripts/atomic/gen-atomic-long.sh b/scripts/atomic/gen-atomic-long.sh
index 9826be3ba986..874643dc74bd 100755
--- a/scripts/atomic/gen-atomic-long.sh
+++ b/scripts/atomic/gen-atomic-long.sh
@@ -79,14 +79,18 @@ cat << EOF
 
 #ifdef CONFIG_64BIT
 typedef atomic64_t atomic_long_t;
-#define ATOMIC_LONG_INIT(i)		ATOMIC64_INIT(i)
-#define atomic_long_cond_read_acquire	atomic64_cond_read_acquire
-#define atomic_long_cond_read_relaxed	atomic64_cond_read_relaxed
+#define ATOMIC_LONG_INIT(i)			ATOMIC64_INIT(i)
+#define atomic_long_cond_read_acquire		atomic64_cond_read_acquire
+#define atomic_long_cond_read_relaxed		atomic64_cond_read_relaxed
+#define atomic_long_cond_read_acquire_timeout	atomic64_cond_read_acquire_timeout
+#define atomic_long_cond_read_relaxed_timeout	atomic64_cond_read_relaxed_timeout
 #else
 typedef atomic_t atomic_long_t;
-#define ATOMIC_LONG_INIT(i)		ATOMIC_INIT(i)
-#define atomic_long_cond_read_acquire	atomic_cond_read_acquire
-#define atomic_long_cond_read_relaxed	atomic_cond_read_relaxed
+#define ATOMIC_LONG_INIT(i)			ATOMIC_INIT(i)
+#define atomic_long_cond_read_acquire		atomic_cond_read_acquire
+#define atomic_long_cond_read_relaxed		atomic_cond_read_relaxed
+#define atomic_long_cond_read_acquire_timeout	atomic_cond_read_acquire_timeout
+#define atomic_long_cond_read_relaxed_timeout	atomic_cond_read_relaxed_timeout
 #endif
 
 EOF
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 07/14] atomic: Add atomic_cond_read_*_timeout()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora, Boqun Feng
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

Add atomic load wrappers, atomic_cond_read_*_timeout() and
atomic64_cond_read_*_timeout() for the cond-load timeout interfaces.

Also add a short description for the atomic_cond_read_{relaxed,acquire}(),
and the atomic_cond_read_{relaxed,acquire}_timeout() interfaces.

Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 Documentation/atomic_t.txt | 14 +++++++++-----
 include/linux/atomic.h     | 10 ++++++++++
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/Documentation/atomic_t.txt b/Documentation/atomic_t.txt
index bee3b1bca9a7..0e53f6ccb558 100644
--- a/Documentation/atomic_t.txt
+++ b/Documentation/atomic_t.txt
@@ -16,6 +16,10 @@ Non-RMW ops:
   atomic_read(), atomic_set()
   atomic_read_acquire(), atomic_set_release()
 
+Non-RMW, non-atomic_t ops:
+
+  atomic_cond_read_{relaxed,acquire}()
+  atomic_cond_read_{relaxed,acquire}_timeout()
 
 RMW atomic operations:
 
@@ -79,11 +83,11 @@ SEMANTICS
 
 Non-RMW ops:
 
-The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
-implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
-smp_store_release() respectively. Therefore, if you find yourself only using
-the Non-RMW operations of atomic_t, you do not in fact need atomic_t at all
-and are doing it wrong.
+The non-RMW ops are (typically) regular, or conditional LOADs and STOREs and
+are canonically implemented using READ_ONCE(), WRITE_ONCE(),
+smp_load_acquire() and smp_store_release() respectively. Therefore, if you
+find yourself only using the Non-RMW operations of atomic_t, you do not in
+fact need atomic_t at all and are doing it wrong.
 
 A note for the implementation of atomic_set{}() is that it must not break the
 atomicity of the RMW ops. That is:
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 8dd57c3a99e9..5bcb86e07784 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -31,6 +31,16 @@
 #define atomic64_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
 #define atomic64_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
 
+#define atomic_cond_read_acquire_timeout(v, c, e, t) \
+	smp_cond_load_acquire_timeout(&(v)->counter, (c), (e), (t))
+#define atomic_cond_read_relaxed_timeout(v, c, e, t) \
+	smp_cond_load_relaxed_timeout(&(v)->counter, (c), (e), (t))
+
+#define atomic64_cond_read_acquire_timeout(v, c, e, t) \
+	smp_cond_load_acquire_timeout(&(v)->counter, (c), (e), (t))
+#define atomic64_cond_read_relaxed_timeout(v, c, e, t) \
+	smp_cond_load_relaxed_timeout(&(v)->counter, (c), (e), (t))
+
 /*
  * The idea here is to build acquire/release variants by adding explicit
  * barriers on top of the relaxed variant. In the case where the relaxed
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 05/14] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

In preparation for defining smp_cond_load_acquire_timeout(), remove
the private copy. Lacking this, the rqspinlock code falls back to using
smp_cond_load_acquire().

Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/rqspinlock.h | 85 -----------------------------
 1 file changed, 85 deletions(-)

diff --git a/arch/arm64/include/asm/rqspinlock.h b/arch/arm64/include/asm/rqspinlock.h
index 9ea0a74e5892..a385603436e9 100644
--- a/arch/arm64/include/asm/rqspinlock.h
+++ b/arch/arm64/include/asm/rqspinlock.h
@@ -3,91 +3,6 @@
 #define _ASM_RQSPINLOCK_H
 
 #include <asm/barrier.h>
-
-/*
- * Hardcode res_smp_cond_load_acquire implementations for arm64 to a custom
- * version based on [0]. In rqspinlock code, our conditional expression involves
- * checking the value _and_ additionally a timeout. However, on arm64, the
- * WFE-based implementation may never spin again if no stores occur to the
- * locked byte in the lock word. As such, we may be stuck forever if
- * event-stream based unblocking is not available on the platform for WFE spin
- * loops (arch_timer_evtstrm_available).
- *
- * Once support for smp_cond_load_acquire_timewait [0] lands, we can drop this
- * copy-paste.
- *
- * While we rely on the implementation to amortize the cost of sampling
- * cond_expr for us, it will not happen when event stream support is
- * unavailable, time_expr check is amortized. This is not the common case, and
- * it would be difficult to fit our logic in the time_expr_ns >= time_limit_ns
- * comparison, hence just let it be. In case of event-stream, the loop is woken
- * up at microsecond granularity.
- *
- * [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
- */
-
-#ifndef smp_cond_load_acquire_timewait
-
-#define smp_cond_time_check_count	200
-
-#define __smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns,	\
-					 time_limit_ns) ({		\
-	typeof(ptr) __PTR = (ptr);					\
-	__unqual_scalar_typeof(*ptr) VAL;				\
-	unsigned int __count = 0;					\
-	for (;;) {							\
-		VAL = READ_ONCE(*__PTR);				\
-		if (cond_expr)						\
-			break;						\
-		cpu_relax();						\
-		if (__count++ < smp_cond_time_check_count)		\
-			continue;					\
-		if ((time_expr_ns) >= (time_limit_ns))			\
-			break;						\
-		__count = 0;						\
-	}								\
-	(typeof(*ptr))VAL;						\
-})
-
-#define __smp_cond_load_acquire_timewait(ptr, cond_expr,		\
-					 time_expr_ns, time_limit_ns)	\
-({									\
-	typeof(ptr) __PTR = (ptr);					\
-	__unqual_scalar_typeof(*ptr) VAL;				\
-	for (;;) {							\
-		VAL = smp_load_acquire(__PTR);				\
-		if (cond_expr)						\
-			break;						\
-		__cmpwait_relaxed(__PTR, VAL);				\
-		if ((time_expr_ns) >= (time_limit_ns))			\
-			break;						\
-	}								\
-	(typeof(*ptr))VAL;						\
-})
-
-#define smp_cond_load_acquire_timewait(ptr, cond_expr,			\
-				      time_expr_ns, time_limit_ns)	\
-({									\
-	__unqual_scalar_typeof(*ptr) _val;				\
-	int __wfe = arch_timer_evtstrm_available();			\
-									\
-	if (likely(__wfe)) {						\
-		_val = __smp_cond_load_acquire_timewait(ptr, cond_expr,	\
-							time_expr_ns,	\
-							time_limit_ns);	\
-	} else {							\
-		_val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr,	\
-							time_expr_ns,	\
-							time_limit_ns);	\
-		smp_acquire__after_ctrl_dep();				\
-	}								\
-	(typeof(*ptr))_val;						\
-})
-
-#endif
-
-#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire_timewait(v, c, 0, 1)
-
 #include <asm-generic/rqspinlock.h>
 
 #endif /* _ASM_RQSPINLOCK_H */
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 06/14] asm-generic: barrier: Add smp_cond_load_acquire_timeout()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

Add the acquire variant of smp_cond_load_relaxed_timeout(). This
reuses the relaxed variant, with additional LOAD->LOAD ordering.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/asm-generic/barrier.h | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index e5a6a1c04649..68d9e7108f4a 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -342,6 +342,32 @@ do {									\
 })
 #endif
 
+/**
+ * smp_cond_load_acquire_timeout() - (Spin) wait for cond with ACQUIRE ordering
+ * until a timeout expires.
+ * @ptr: pointer to the variable to wait on.
+ * @cond_expr: boolean expression to wait for.
+ * @time_expr_ns: monotonic expression that evaluates to time in ns or,
+ *  on failure, returns a negative value.
+ * @timeout_ns: timeout value in ns
+ * (Both of the above are assumed to be compatible with s64.)
+ *
+ * Equivalent to using smp_cond_load_acquire() on the condition variable with
+ * a timeout.
+ */
+#ifndef smp_cond_load_acquire_timeout
+#define smp_cond_load_acquire_timeout(ptr, cond_expr,			\
+				      time_expr_ns, timeout_ns)		\
+({									\
+	__unqual_scalar_typeof(*ptr) _val;				\
+	_val = smp_cond_load_relaxed_timeout(ptr, cond_expr,		\
+					     time_expr_ns,		\
+					     timeout_ns);		\
+	smp_acquire__after_ctrl_dep();					\
+	(typeof(*ptr))_val;						\
+})
+#endif
+
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 04/14] arm64: support WFET in smp_cond_load_relaxed_timeout()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

To handle WFET use __cmpwait_timeout() similarly to __cmpwait(). These
call out to the respective __cmpwait_case_timeout_##sz(),
__cmpwait_case_##sz() functions.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/barrier.h |  8 +++--
 arch/arm64/include/asm/cmpxchg.h | 62 +++++++++++++++++++++++++-------
 2 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6190e178db51..fbd71cd4ef4e 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -224,8 +224,8 @@ do {									\
 extern bool arch_timer_evtstrm_available(void);
 
 /*
- * In the common case, cpu_poll_relax() sits waiting in __cmpwait_relaxed()
- * for the ptr value to change.
+ * In the common case, cpu_poll_relax() sits waiting in __cmpwait_relaxed()/
+ * __cmpwait_relaxed_timeout() for the ptr value to change.
  *
  * Since this period is reasonably long, choose SMP_TIMEOUT_POLL_COUNT
  * to be 1, so smp_cond_load_{relaxed,acquire}_timeout() does a
@@ -234,7 +234,9 @@ extern bool arch_timer_evtstrm_available(void);
 #define SMP_TIMEOUT_POLL_COUNT	1
 
 #define cpu_poll_relax(ptr, val, timeout_ns) do {			\
-	if (arch_timer_evtstrm_available())				\
+	if (alternative_has_cap_unlikely(ARM64_HAS_WFXT))		\
+		__cmpwait_relaxed_timeout(ptr, val, timeout_ns);	\
+	else if (arch_timer_evtstrm_available())			\
 		__cmpwait_relaxed(ptr, val);				\
 	else								\
 		cpu_relax();						\
diff --git a/arch/arm64/include/asm/cmpxchg.h b/arch/arm64/include/asm/cmpxchg.h
index 6cf3cd6873f5..9e4cdc9e41d1 100644
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -12,6 +12,7 @@
 
 #include <asm/barrier.h>
 #include <asm/lse.h>
+#include <asm/delay-const.h>
 
 /*
  * We need separate acquire parameters for ll/sc and lse, since the full
@@ -212,7 +213,8 @@ __CMPXCHG_GEN(_mb)
 
 #define __CMPWAIT_CASE(w, sfx, sz)					\
 static inline void __cmpwait_case_##sz(volatile void *ptr,		\
-				       unsigned long val)		\
+				       unsigned long val,		\
+				       u64 __maybe_unused timeout_ns)	\
 {									\
 	unsigned long tmp;						\
 									\
@@ -235,20 +237,52 @@ __CMPWAIT_CASE( ,  , 64);
 
 #undef __CMPWAIT_CASE
 
-#define __CMPWAIT_GEN(sfx)						\
-static __always_inline void __cmpwait##sfx(volatile void *ptr,		\
-				  unsigned long val,			\
-				  int size)				\
+#define __CMPWAIT_TIMEOUT_CASE(w, sfx, sz)				\
+static inline void __cmpwait_case_timeout_##sz(volatile void *ptr,	\
+					       unsigned long val,	\
+					       u64 timeout_ns)		\
+{									\
+	unsigned long tmp;						\
+	u64 ecycles = __delay_cycles() +				\
+			NSECS_TO_CYCLES(timeout_ns);			\
+	asm volatile(							\
+	"	sevl\n"							\
+	"	wfe\n"							\
+	"	ldxr" #sfx "\t%" #w "[tmp], %[v]\n"			\
+	"	eor	%" #w "[tmp], %" #w "[tmp], %" #w "[val]\n"	\
+	"	cbnz	%" #w "[tmp], 2f\n"				\
+	"	msr s0_3_c1_c0_0, %[ecycles]\n"				\
+	"2:"								\
+	: [tmp] "=&r" (tmp), [v] "+Q" (*(u##sz *)ptr)			\
+	: [val] "r" (val), [ecycles] "r" (ecycles));			\
+}
+
+__CMPWAIT_TIMEOUT_CASE(w, b, 8);
+__CMPWAIT_TIMEOUT_CASE(w, h, 16);
+__CMPWAIT_TIMEOUT_CASE(w,  , 32);
+__CMPWAIT_TIMEOUT_CASE( ,  , 64);
+
+#undef __CMPWAIT_TIMEOUT_CASE
+
+#define __CMPWAIT_GEN(timeout, sfx)					\
+static __always_inline void __cmpwait##timeout##sfx(volatile void *ptr,	\
+						    unsigned long val,	\
+						    u64 timeout_ns,	\
+						    int size)		\
 {									\
 	switch (size) {							\
 	case 1:								\
-		return __cmpwait_case##sfx##_8(ptr, (u8)val);		\
+		return __cmpwait_case##timeout##sfx##_8(ptr, (u8)val,	\
+							timeout_ns);	\
 	case 2:								\
-		return __cmpwait_case##sfx##_16(ptr, (u16)val);		\
+		return __cmpwait_case##timeout##sfx##_16(ptr, (u16)val,	\
+							 timeout_ns);	\
 	case 4:								\
-		return __cmpwait_case##sfx##_32(ptr, val);		\
+		return __cmpwait_case##timeout##sfx##_32(ptr, val,	\
+							 timeout_ns);	\
 	case 8:								\
-		return __cmpwait_case##sfx##_64(ptr, val);		\
+		return __cmpwait_case##timeout##sfx##_64(ptr, val,	\
+							 timeout_ns);	\
 	default:							\
 		BUILD_BUG();						\
 	}								\
@@ -256,11 +290,15 @@ static __always_inline void __cmpwait##sfx(volatile void *ptr,		\
 	unreachable();							\
 }
 
-__CMPWAIT_GEN()
+__CMPWAIT_GEN(        , )
+__CMPWAIT_GEN(_timeout, )
 
 #undef __CMPWAIT_GEN
 
-#define __cmpwait_relaxed(ptr, val) \
-	__cmpwait((ptr), (unsigned long)(val), sizeof(*(ptr)))
+#define __cmpwait_relaxed_timeout(ptr, val, timeout_ns)			\
+	__cmpwait_timeout((ptr), (unsigned long)(val), timeout_ns, sizeof(*(ptr)))
+
+#define __cmpwait_relaxed(ptr, val)					\
+	__cmpwait((ptr), (unsigned long)(val), 0, sizeof(*(ptr)))
 
 #endif	/* __ASM_CMPXCHG_H */
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 03/14] arm64/delay: move some constants out to a separate header
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora, Bjorn Andersson,
	Konrad Dybcio, Christoph Lameter
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

Moves some constants and functions related to xloops, cycles computation
out to a new header. Also make __delay_cycles() available outside of
arch/arm64/lib/delay.c.

Rename some macros in qcom/rpmh-rsc.c which were occupying the same
namespace.

No functional change.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Bjorn Andersson <andersson@kernel.org>
Cc: Konrad Dybcio <konradybcio@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/delay-const.h | 27 +++++++++++++++++++++++++++
 arch/arm64/lib/delay.c               | 15 ++++-----------
 drivers/soc/qcom/rpmh-rsc.c          |  8 ++++----
 3 files changed, 35 insertions(+), 15 deletions(-)
 create mode 100644 arch/arm64/include/asm/delay-const.h

diff --git a/arch/arm64/include/asm/delay-const.h b/arch/arm64/include/asm/delay-const.h
new file mode 100644
index 000000000000..cb3988ff4e41
--- /dev/null
+++ b/arch/arm64/include/asm/delay-const.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_DELAY_CONST_H
+#define _ASM_DELAY_CONST_H
+
+#include <asm/param.h>	/* For HZ */
+
+/* 2**32 / 1000000 (rounded up) */
+#define __usecs_to_xloops_mult	0x10C7UL
+
+/* 2**32 / 1000000000 (rounded up) */
+#define __nsecs_to_xloops_mult	0x5UL
+
+extern unsigned long loops_per_jiffy;
+static inline unsigned long xloops_to_cycles(unsigned long xloops)
+{
+	return (xloops * loops_per_jiffy * HZ) >> 32;
+}
+
+#define USECS_TO_CYCLES(time_usecs) \
+	xloops_to_cycles((time_usecs) * __usecs_to_xloops_mult)
+
+#define NSECS_TO_CYCLES(time_nsecs) \
+	xloops_to_cycles((time_nsecs) * __nsecs_to_xloops_mult)
+
+u64 notrace __delay_cycles(void);
+
+#endif	/* _ASM_DELAY_CONST_H */
diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index e278e060e78a..c660a7ea26dd 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -12,17 +12,10 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/timex.h>
+#include <asm/delay-const.h>
 
 #include <clocksource/arm_arch_timer.h>
 
-#define USECS_TO_CYCLES(time_usecs)			\
-	xloops_to_cycles((time_usecs) * 0x10C7UL)
-
-static inline unsigned long xloops_to_cycles(unsigned long xloops)
-{
-	return (xloops * loops_per_jiffy * HZ) >> 32;
-}
-
 /*
  * Force the use of CNTVCT_EL0 in order to have the same base as WFxT.
  * This avoids some annoying issues when CNTVOFF_EL2 is not reset 0 on a
@@ -32,7 +25,7 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops)
  * Note that userspace cannot change the offset behind our back either,
  * as the vcpu mutex is held as long as KVM_RUN is in progress.
  */
-static cycles_t notrace __delay_cycles(void)
+u64 notrace __delay_cycles(void)
 {
 	guard(preempt_notrace)();
 	return __arch_counter_get_cntvct_stable();
@@ -73,12 +66,12 @@ EXPORT_SYMBOL(__const_udelay);
 
 void __udelay(unsigned long usecs)
 {
-	__const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
+	__const_udelay(usecs * __usecs_to_xloops_mult);
 }
 EXPORT_SYMBOL(__udelay);
 
 void __ndelay(unsigned long nsecs)
 {
-	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
+	__const_udelay(nsecs * __nsecs_to_xloops_mult);
 }
 EXPORT_SYMBOL(__ndelay);
diff --git a/drivers/soc/qcom/rpmh-rsc.c b/drivers/soc/qcom/rpmh-rsc.c
index c6f7d5c9c493..ad5ec5c0de0a 100644
--- a/drivers/soc/qcom/rpmh-rsc.c
+++ b/drivers/soc/qcom/rpmh-rsc.c
@@ -146,10 +146,10 @@ enum {
  *  +---------------------------------------------------+
  */
 
-#define USECS_TO_CYCLES(time_usecs)			\
-	xloops_to_cycles((time_usecs) * 0x10C7UL)
+#define RPMH_USECS_TO_CYCLES(time_usecs)		\
+	rpmh_xloops_to_cycles((time_usecs) * 0x10C7UL)
 
-static inline unsigned long xloops_to_cycles(u64 xloops)
+static inline unsigned long rpmh_xloops_to_cycles(u64 xloops)
 {
 	return (xloops * loops_per_jiffy * HZ) >> 32;
 }
@@ -819,7 +819,7 @@ void rpmh_rsc_write_next_wakeup(struct rsc_drv *drv)
 	wakeup_us = ktime_to_us(wakeup);
 
 	/* Convert the wakeup to arch timer scale */
-	wakeup_cycles = USECS_TO_CYCLES(wakeup_us);
+	wakeup_cycles = RPMH_USECS_TO_CYCLES(wakeup_us);
 	wakeup_cycles += arch_timer_read_counter();
 
 exit:
-- 
2.31.1


^ permalink raw reply related

* [PATCH v11 00/14] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora

Hi,

Main change in this version:
  - adds a kunit validation test.

What remains?:
  - Review by PeterZ of the new interface tif_need_resched_relaxed_wait()
    (patch 11, "sched: add need-resched timed wait interface").
    (Peter had originally proposed using smp_cond_load_relaxed() in
     poll_idle() [11]).

The core kernel often uses smp_cond_load_{relaxed,acquire}() to spin
on condition variables with architectural primitives used to avoid
hammering the relevant cachelines.

(This primitive can vary greatly across architectures: on x86 it's a
cpu_relax() to slow down the pipeline. On arm64, this is a __cmpwait()
which waits for a cacheline to change state in a time limited fashion.)

Regardless of architectural details, typical smp_cond_load*() usage
does not allow for termination until the condition change occurs.

Beyond the core kernel, there are cases where it is useful to additionally
terminate on a timeout. Two cases:

  - cpuidle poll_idle(): wait for need-resched until the cpuidle polling
    duration expires.

  - rqspinlock: nested qspinlock acquisition that terminates on timeout
    or deadlock.

Accordingly add two interfaces (with their generic and arm64 specific
implementations):

   smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr, timeout)
   smp_cond_load_acquire_timeout(ptr, cond_expr, time_expr, timeout)

Also add tif_need_resched_relaxed_wait() which wraps the polling
pattern and its scheduler specific details in poll_idle().
In addition add atomic_cond_read_*_timeout(),
atomic64_cond_read_*_timeout(), and atomic_long wrappers.

Structurally, both the smp_cond_load_*_timeout() interfaces are similar
to smp_cond_load*(), with the addition of a rate-limited time-check.

Usage
==

These interfaces drop straight-forwardly into the rqspinlock logic
since qspinlock already uses smp_cond_load*(), and the time-check
extension can now be used for timeout and deadlock handling.

Using tif_need_resched_relaxed_wait() in poll_idle() removes any
architectural details allowing arm64 to straight-forwardly support
that path.
(However, for efficiency reasons cpuidle/poll_state.c continues to
depend on ARCH_HAS_CPU_RELAX since that is defined on architectures
with an optimized architectural primitive.)


Performance
==

Apart from simplifications due to this change, supporting polling in
cpuidle on arm64 helps improve wakeup latency (needs a few cpuidle/acpi
patches):


  # perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
  perf bench sched pipe -l 1000000 -c 4

  # No haltpoll (and, no TIF_POLLING_NRFLAG):

  Performance counter stats for 'CPU(s) 4,5' (5 runs):

         25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
    45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
    26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
                 0      sched:sched_wake_idle_without_ipi #    0.000 /sec

       12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )


  # Haltpoll:

  Performance counter stats for 'CPU(s) 4,5' (5 runs):

         15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
    34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
    20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
         1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                       ( +-  0.78% )

        7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )

  We get improved latency because we don't switch in and out of a
  deeper sleep state or from the hypervisor. This also causes us to
  execute ~20% fewer instructions.


Haris Okanovic also saw improvement in real workloads due to the
cpuidle changes: "observed 4-6% improvements in memcahed, cassandra,
mysql, and postgresql under certain loads. Other applications likely
benefit too." [12]


Changelog:
  v10 [10]:
   - add a comment mentioning that smp_cond_load_relaxed_timeout() might
     be using architectural primitives that don't support MMIO.
     (David Laight, Catalin Marinas)
   - added a kunit test for smp_cond_load_relaxed_timeout() (Andrew
     Morton.)

  v9 [9]:
   - s/@cond/@cond_expr/ (Randy Dunlap)
   - Clarify that SMP_TIMEOUT_POLL_COUNT is only around memory
     addresses. (David Laight)
   - Add the missing config ARCH_HAS_CPU_RELAX in arch/arm64/Kconfig.
     (Catalin Marinas).
   - Switch to arch_counter_get_cntvct_stable() (via __delay_cycles())
     in the cmpwait path instead of using arch_timer_read_counter().
     (Catalin Marinas)

  v8 [0]:
   - Defer evaluation of @time_expr_ns to when we hit the slowpath.
      (comment from Alexei Starovoitov).

   - Mention that cpu_poll_relax() is better than raw CPU polling
     only where ARCH_HAS_CPU_RELAX is defined.
     - also define ARCH_HAS_CPU_RELAX for arm64.
      (Came out of a discussion with Will Deacon.)

   - Split out WFET and WFE handling. I was doing both of these
     in a common handler.
     (From Will Deacon and in an earlier revision by Catalin Marinas.)

   - Add mentions of atomic_cond_read_{relaxed,acquire}(),
     atomic_cond_read_{relaxed,acquire}_timeout() in
     Documentation/atomic_t.txt.

   - Use the BIT() macro to do the checking in tif_bitset_relaxed_wait().

   - Cleanup unnecessary assignments, casts etc in poll_idle().
     (From Rafael Wysocki.)

   - Fixup warnings from kernel build robot


  v7 [1]:
   - change the interface to separately provide the timeout. This is
     useful for supporting WFET and similar primitives which can do
     timed waiting (suggested by Arnd Bergmann).

   - Adapting rqspinlock code to this changed interface also
     necessitated allowing time_expr to fail.
   - rqspinlock changes to adapt to the new smp_cond_load_acquire_timeout().

   - add WFET support (suggested by Arnd Bergmann).
   - add support for atomic-long wrappers.
   - add a new scheduler interface tif_need_resched_relaxed_wait() which
     encapsulates the polling logic used by poll_idle().
     - interface suggested by (Rafael J. Wysocki).


  v6 [2]:
   - fixup missing timeout parameters in atomic64_cond_read_*_timeout()
   - remove a race between setting of TIF_NEED_RESCHED and the call to
     smp_cond_load_relaxed_timeout(). This would mean that dev->poll_time_limit
     would be set even if we hadn't spent any time waiting.
     (The original check compared against local_clock(), which would have been
     fine, but I was instead using a cheaper check against _TIF_NEED_RESCHED.)
   (Both from meta-CI bot)


  v5 [3]:
   - use cpu_poll_relax() instead of cpu_relax().
   - instead of defining an arm64 specific
     smp_cond_load_relaxed_timeout(), just define the appropriate
     cpu_poll_relax().
   - re-read the target pointer when we exit due to the time-check.
   - s/SMP_TIMEOUT_SPIN_COUNT/SMP_TIMEOUT_POLL_COUNT/
   (Suggested by Will Deacon)

   - add atomic_cond_read_*_timeout() and atomic64_cond_read_*_timeout()
     interfaces.
   - rqspinlock: use atomic_cond_read_acquire_timeout().
   - cpuidle: use smp_cond_load_relaxed_tiemout() for polling.
   (Suggested by Catalin Marinas)

   - rqspinlock: define SMP_TIMEOUT_POLL_COUNT to be 16k for non arm64


  v4 [4]:
    - naming change 's/timewait/timeout/'
    - resilient spinlocks: get rid of res_smp_cond_load_acquire_waiting()
      and fixup use of RES_CHECK_TIMEOUT().
    (Both suggested by Catalin Marinas)

  v3 [5]:
    - further interface simplifications (suggested by Catalin Marinas)

  v2 [6]:
    - simplified the interface (suggested by Catalin Marinas)
       - get rid of wait_policy, and a multitude of constants
       - adds a slack parameter
      This helped remove a fair amount of duplicated code duplication and in
      hindsight unnecessary constants.

  v1 [7]:
     - add wait_policy (coarse and fine)
     - derive spin-count etc at runtime instead of using arbitrary
       constants.

Haris Okanovic tested v4 of this series with poll_idle()/haltpoll patches. [8]

Comments appreciated!

Thanks
Ankur

 [0] https://lore.kernel.org/lkml/20251215044919.460086-1-ankur.a.arora@oracle.com/
 [1] https://lore.kernel.org/lkml/20251028053136.692462-1-ankur.a.arora@oracle.com/
 [2] https://lore.kernel.org/lkml/20250911034655.3916002-1-ankur.a.arora@oracle.com/
 [3] https://lore.kernel.org/lkml/20250911034655.3916002-1-ankur.a.arora@oracle.com/
 [4] https://lore.kernel.org/lkml/20250829080735.3598416-1-ankur.a.arora@oracle.com/
 [5] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/
 [6] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/
 [7] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
 [8] https://lore.kernel.org/lkml/2cecbf7fb23ee83a4ce027e1be3f46f97efd585c.camel@amazon.com/
 [9] https://lore.kernel.org/lkml/20260209023153.2661784-1-ankur.a.arora@oracle.com/
 [10] https://lore.kernel.org/lkml/20260316013651.3225328-1-ankur.a.arora@oracle.com/
 [11] https://lore.kernel.org/lkml/20230809134837.GM212435@hirez.programming.kicks-ass.net/
 [12] https://lore.kernel.org/lkml/c6f3c8d3f1f2e89a9dc7ae22482973b5a51b08cb.camel@amazon.com/

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: bpf@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-pm@vger.kernel.org

Ankur Arora (14):
  asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
  arm64: barrier: Support smp_cond_load_relaxed_timeout()
  arm64/delay: move some constants out to a separate header
  arm64: support WFET in smp_cond_load_relaxed_timeout()
  arm64: rqspinlock: Remove private copy of
    smp_cond_load_acquire_timewait()
  asm-generic: barrier: Add smp_cond_load_acquire_timeout()
  atomic: Add atomic_cond_read_*_timeout()
  locking/atomic: scripts: build atomic_long_cond_read_*_timeout()
  bpf/rqspinlock: switch check_timeout() to a clock interface
  bpf/rqspinlock: Use smp_cond_load_acquire_timeout()
  sched: add need-resched timed wait interface
  cpuidle/poll_state: Wait for need-resched via
    tif_need_resched_relaxed_wait()
  kunit: enable testing smp_cond_load_relaxed_timeout()
  kunit: add tests for smp_cond_load_relaxed_timeout()

 Documentation/atomic_t.txt           |  14 +--
 arch/arm64/Kconfig                   |   3 +
 arch/arm64/include/asm/barrier.h     |  23 +++++
 arch/arm64/include/asm/cmpxchg.h     |  62 ++++++++++---
 arch/arm64/include/asm/delay-const.h |  27 ++++++
 arch/arm64/include/asm/rqspinlock.h  |  85 ------------------
 arch/arm64/lib/delay.c               |  17 ++--
 drivers/clocksource/arm_arch_timer.c |   2 +
 drivers/cpuidle/poll_state.c         |  21 +----
 drivers/soc/qcom/rpmh-rsc.c          |   8 +-
 include/asm-generic/barrier.h        |  95 ++++++++++++++++++++
 include/linux/atomic.h               |  10 +++
 include/linux/atomic/atomic-long.h   |  18 ++--
 include/linux/sched/idle.h           |  29 +++++++
 kernel/bpf/rqspinlock.c              |  77 +++++++++++------
 lib/Kconfig.debug                    |  10 +++
 lib/tests/Makefile                   |   1 +
 lib/tests/barrier-timeout-test.c     | 125 +++++++++++++++++++++++++++
 scripts/atomic/gen-atomic-long.sh    |  16 ++--
 19 files changed, 465 insertions(+), 178 deletions(-)
 create mode 100644 arch/arm64/include/asm/delay-const.h
 create mode 100644 lib/tests/barrier-timeout-test.c

-- 
2.31.1


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox