Linux Power Management development

Linux Power Management development
 help / color / mirror / Atom feed

* Re: [PATCH] interconnect: imx: fix use-after-free in imx_icc_node_init_qos()
From: Markus Elfring @ 2026-04-08  7:28 UTC (permalink / raw)
  To: vulab, imx, linux-pm, linux-arm-kernel, Georgi Djakov,
	Sascha Hauer, Shawn Guo
  Cc: kernel, stable, LKML, Fabio Estevam
In-Reply-To: <20260408031004.309483-1-vulab@iscas.ac.cn>

> Move of_node_put(dn) after the last use of dn, and add a missing put
> in the error path to avoid both use-after-free and reference leak.

How do you think about to increase the application of scope-based resource management?
https://elixir.bootlin.com/linux/v7.0-rc7/source/include/linux/of.h#L138
https://elixir.bootlin.com/linux/v7.0-rc7/source/drivers/interconnect/imx/imx.c#L117-L160

Regards,
Markus

^ permalink raw reply

* Re: [PATCH v6 01/27] Revert "treewide: Fix probing of devices in DT overlays"
From: Geert Uytterhoeven @ 2026-04-08  8:03 UTC (permalink / raw)
  To: Herve Codina
  Cc: Andrew Lunn, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Kalle Niemi, Matti Vaittinen, Greg Kroah-Hartman,
	Rafael J. Wysocki, Danilo Krummrich, Frank Li, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, Michael Turquette,
	Stephen Boyd, Andi Shyti, Wolfram Sang, Peter Rosin,
	Arnd Bergmann, Saravana Kannan, Bjorn Helgaas, Charles Keepax,
	Richard Fitzgerald, David Rhodes, Linus Walleij, Ulf Hansson,
	Mark Brown, Len Brown, Andy Shevchenko, Daniel Scally,
	Heikki Krogerus, Sakari Ailus, Davidlohr Bueso, Jonathan Cameron,
	Dave Jiang, Alison Schofield, Vishal Verma, Ira Weiny,
	Dan Williams, Shawn Guo, Wolfram Sang, linux-kernel, driver-core,
	imx, linux-arm-kernel, linux-clk, linux-i2c, devicetree,
	linux-pci, linux-sound, patches, linux-gpio, linux-pm, linux-spi,
	linux-acpi, linux-cxl, Allan Nielsen, Horatiu Vultur,
	Steen Hegelund, Luca Ceresoli, Thomas Petazzoni, Saravana Kannan
In-Reply-To: <20260325143555.451852-2-herve.codina@bootlin.com>

On Wed, 25 Mar 2026 at 15:36, Herve Codina <herve.codina@bootlin.com> wrote:
> From: Saravana Kannan <saravanak@google.com>
>
> This reverts commit 1a50d9403fb90cbe4dea0ec9fd0351d2ecbd8924.
>
> While the commit fixed fw_devlink overlay handling for one case, it
> broke it for another case. So revert it and redo the fix in a separate
> patch.
>
> Fixes: 1a50d9403fb9 ("treewide: Fix probing of devices in DT overlays")
> Reported-by: Herve Codina <herve.codina@bootlin.com>
> Closes: https://lore.kernel.org/lkml/CAMuHMdXEnSD4rRJ-o90x4OprUacN_rJgyo8x6=9F9rZ+-KzjOg@mail.gmail.com/
> Closes: https://lore.kernel.org/all/20240221095137.616d2aaa@bootlin.com/
> Closes: https://lore.kernel.org/lkml/20240312151835.29ef62a0@bootlin.com/
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> Link: https://lore.kernel.org/lkml/20240411235623.1260061-2-saravanak@google.com/
> Signed-off-by: Herve Codina <herve.codina@bootlin.com>
> Acked-by: Mark Brown <broonie@kernel.org>

> --- a/drivers/bus/imx-weim.c
> +++ b/drivers/bus/imx-weim.c
> @@ -327,12 +327,6 @@ static int of_weim_notify(struct notifier_block *nb, unsigned long action,
>                                  "Failed to setup timing for '%pOF'\n", rd->dn);
>
>                 if (!of_node_check_flag(rd->dn, OF_POPULATED)) {
> -                       /*
> -                        * Clear the flag before adding the device so that
> -                        * fw_devlink doesn't skip adding consumers to this
> -                        * device.
> -                        */
> -                       rd->dn->fwnode.flags &= ~FWNODE_FLAG_NOT_DEVICE;
>                         if (!of_platform_device_create(rd->dn, NULL, &pdev->dev)) {
>                                 dev_err(&pdev->dev,
>                                         "Failed to create child device '%pOF'\n",

Note that all these removals no longer apply cleanly due to commit
f72e77c33e4b5657 ("device property: Make modifications of fwnode
"flags" thread safe") in driver-core-next, which is gonna complicate
backporting to stable.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH v3 1/2] power: reset: Add QEMU virt-ctrl driver
From: Geert Uytterhoeven @ 2026-04-08  9:10 UTC (permalink / raw)
  To: Sebastian Reichel
  Cc: Kuan-Wei Chiu, jserv, eleanor15x, daniel, laurent, linux-kernel,
	linux-m68k, linux-pm
In-Reply-To: <ac7kP64nbhZzwbJV@venus>

On Thu, 2 Apr 2026 at 23:52, Sebastian Reichel
<sebastian.reichel@collabora.com> wrote:
> On Sun, Feb 22, 2026 at 05:32:24PM +0000, Kuan-Wei Chiu wrote:
> > Add a new driver for the 'virt-ctrl' device found on QEMU virt machines
> > (e.g. m68k). This device provides a simple interface for system reset
> > and power off [1].
> >
> > This driver utilizes the modern system-off API to register callbacks
> > for both system restart and power off. It also registers a reboot
> > notifier to catch SYS_HALT events, ensuring that LINUX_REBOOT_CMD_HALT
> > is properly handled. It is designed to be generic and can be reused by
> > other architectures utilizing this QEMU device.
> >
> > Link: https://gitlab.com/qemu-project/qemu/-/blob/v10.2.0/hw/misc/virt_ctrl.c [1]
> > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > ---
>
> I think this should be merged with the second patch via the m68k
> tree:
>
> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>

Thanks, will queue in the m68k tree for v7.2.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH v3 1/2] power: reset: Add QEMU virt-ctrl driver
From: Geert Uytterhoeven @ 2026-04-08  9:10 UTC (permalink / raw)
  To: Sebastian Reichel
  Cc: Kuan-Wei Chiu, jserv, eleanor15x, daniel, laurent, linux-kernel,
	linux-m68k, linux-pm
In-Reply-To: <CAMuHMdUSTdrv9Dswez-QrcFTPwnnFy7m5gc-ScMwBfQLDayQZQ@mail.gmail.com>

On Wed, 8 Apr 2026 at 11:10, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> On Thu, 2 Apr 2026 at 23:52, Sebastian Reichel
> <sebastian.reichel@collabora.com> wrote:
> > On Sun, Feb 22, 2026 at 05:32:24PM +0000, Kuan-Wei Chiu wrote:
> > > Add a new driver for the 'virt-ctrl' device found on QEMU virt machines
> > > (e.g. m68k). This device provides a simple interface for system reset
> > > and power off [1].
> > >
> > > This driver utilizes the modern system-off API to register callbacks
> > > for both system restart and power off. It also registers a reboot
> > > notifier to catch SYS_HALT events, ensuring that LINUX_REBOOT_CMD_HALT
> > > is properly handled. It is designed to be generic and can be reused by
> > > other architectures utilizing this QEMU device.
> > >
> > > Link: https://gitlab.com/qemu-project/qemu/-/blob/v10.2.0/hw/misc/virt_ctrl.c [1]
> > > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > > ---
> >
> > I think this should be merged with the second patch via the m68k
> > tree:
> >
> > Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
>
> Thanks, will queue in the m68k tree for v7.2.

Oops, v7.1.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH v3 2/2] m68k: virt: Switch to qemu-virt-ctrl driver
From: Geert Uytterhoeven @ 2026-04-08  9:12 UTC (permalink / raw)
  To: Kuan-Wei Chiu
  Cc: sre, jserv, eleanor15x, daniel, laurent, linux-kernel, linux-m68k,
	linux-pm
In-Reply-To: <20260222173225.1105572-3-visitorckw@gmail.com>

On Sun, 22 Feb 2026 at 18:32, Kuan-Wei Chiu <visitorckw@gmail.com> wrote:
> Register the "qemu-virt-ctrl" platform device during board
> initialization to utilize the new generic power/reset driver.
>
> Consequently, remove the legacy reset and power-off implementations
> specific to the virt machine. The platform's mach_reset callback is
> updated to call do_kernel_restart(), bridging the legacy m68k reboot
> path to the generic kernel restart handler framework for this machine.
>
> To prevent any regressions in reboot or power-off functionality when
> the driver is not built-in, explicitly select POWER_RESET and
> POWER_RESET_QEMU_VIRT_CTRL for the VIRT machine in Kconfig.machine.
>
> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> ---
> Changes in v3:
> - Add 'select POWER_RESET' and 'select POWER_RESET_QEMU_VIRT_CTRL' in
>   Kconfig.machine to avoid restart/power-off regressions.

Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
i.e. will queue in the m68k tree for v7.1.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* [rafael-pm:bleeding-edge 170/268] drivers/hwmon/emc2305.c:312:undefined reference to `devm_thermal_of_cooling_device_register'
From: kernel test robot @ 2026-04-08  9:43 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: oe-kbuild-all, linux-acpi, linux-pm, Rafael J. Wysocki

Hi Daniel,

FYI, the error/warning was bisected to this commit, please ignore it if it's irrelevant.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git bleeding-edge
head:   d18364264af84e2a89da14c6b5f0eae2ba7f98de
commit: e1b96fba58c6fe18a31a06f752ebc8ad6921b1cb [170/268] thermal/of: Move OF code where it belongs to
config: x86_64-randconfig-074-20260408 (https://download.01.org/0day-ci/archive/20260408/202604081734.3OJSeExW-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260408/202604081734.3OJSeExW-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202604081734.3OJSeExW-lkp@intel.com/

All errors (new ones prefixed by >>):

   ld: vmlinux.o: in function `emc2305_set_single_tz':
>> drivers/hwmon/emc2305.c:312:(.text+0x311f7a1): undefined reference to `devm_thermal_of_cooling_device_register'
   ld: vmlinux.o: in function `max6650_probe':
>> drivers/hwmon/max6650.c:796:(.text+0x318d88a): undefined reference to `devm_thermal_of_cooling_device_register'
   ld: vmlinux.o: in function `tc654_probe':
>> drivers/hwmon/tc654.c:544:(.text+0x3193384): undefined reference to `devm_thermal_of_cooling_device_register'


vim +312 drivers/hwmon/emc2305.c

0d8400c5a2ce159 Michael Shych   2022-08-10  301  
2ed4db7a1d07b34 Florin Leotescu 2025-06-03  302  static int emc2305_set_single_tz(struct device *dev, struct device_node *fan_node, int idx)
0d8400c5a2ce159 Michael Shych   2022-08-10  303  {
0d8400c5a2ce159 Michael Shych   2022-08-10  304  	struct emc2305_data *data = dev_get_drvdata(dev);
0d8400c5a2ce159 Michael Shych   2022-08-10  305  	long pwm;
0d8400c5a2ce159 Michael Shych   2022-08-10  306  	int i, cdev_idx, ret;
0d8400c5a2ce159 Michael Shych   2022-08-10  307  
0d8400c5a2ce159 Michael Shych   2022-08-10  308  	cdev_idx = (idx) ? idx - 1 : 0;
0d8400c5a2ce159 Michael Shych   2022-08-10  309  	pwm = data->pwm_min[cdev_idx];
0d8400c5a2ce159 Michael Shych   2022-08-10  310  
0d8400c5a2ce159 Michael Shych   2022-08-10  311  	data->cdev_data[cdev_idx].cdev =
2ed4db7a1d07b34 Florin Leotescu 2025-06-03 @312  		devm_thermal_of_cooling_device_register(dev, fan_node,
2115cbeec8a3ccc Florin Leotescu 2025-03-21  313  							emc2305_fan_name[idx], data,
0d8400c5a2ce159 Michael Shych   2022-08-10  314  							&emc2305_cooling_ops);
0d8400c5a2ce159 Michael Shych   2022-08-10  315  
0d8400c5a2ce159 Michael Shych   2022-08-10  316  	if (IS_ERR(data->cdev_data[cdev_idx].cdev)) {
0d8400c5a2ce159 Michael Shych   2022-08-10  317  		dev_err(dev, "Failed to register cooling device %s\n", emc2305_fan_name[idx]);
0d8400c5a2ce159 Michael Shych   2022-08-10  318  		return PTR_ERR(data->cdev_data[cdev_idx].cdev);
0d8400c5a2ce159 Michael Shych   2022-08-10  319  	}
0429415a084a154 Florin Leotescu 2025-06-03  320  
0429415a084a154 Florin Leotescu 2025-06-03  321  	if (data->cdev_data[cdev_idx].cur_state > 0)
0429415a084a154 Florin Leotescu 2025-06-03  322  		/* Update pwm when temperature is above trips */
0429415a084a154 Florin Leotescu 2025-06-03  323  		pwm = EMC2305_PWM_STATE2DUTY(data->cdev_data[cdev_idx].cur_state,
0429415a084a154 Florin Leotescu 2025-06-03  324  					     data->max_state, EMC2305_FAN_MAX);
0429415a084a154 Florin Leotescu 2025-06-03  325  
0d8400c5a2ce159 Michael Shych   2022-08-10  326  	/* Set minimal PWM speed. */
0d8400c5a2ce159 Michael Shych   2022-08-10  327  	if (data->pwm_separate) {
0d8400c5a2ce159 Michael Shych   2022-08-10  328  		ret = emc2305_set_pwm(dev, pwm, cdev_idx);
0d8400c5a2ce159 Michael Shych   2022-08-10  329  		if (ret < 0)
0d8400c5a2ce159 Michael Shych   2022-08-10  330  			return ret;
0d8400c5a2ce159 Michael Shych   2022-08-10  331  	} else {
0d8400c5a2ce159 Michael Shych   2022-08-10  332  		for (i = 0; i < data->pwm_num; i++) {
0d8400c5a2ce159 Michael Shych   2022-08-10  333  			ret = emc2305_set_pwm(dev, pwm, i);
0d8400c5a2ce159 Michael Shych   2022-08-10  334  			if (ret < 0)
0d8400c5a2ce159 Michael Shych   2022-08-10  335  				return ret;
0d8400c5a2ce159 Michael Shych   2022-08-10  336  		}
0d8400c5a2ce159 Michael Shych   2022-08-10  337  	}
0d8400c5a2ce159 Michael Shych   2022-08-10  338  	data->cdev_data[cdev_idx].cur_state =
0429415a084a154 Florin Leotescu 2025-06-03  339  		EMC2305_PWM_DUTY2STATE(pwm, data->max_state,
0d8400c5a2ce159 Michael Shych   2022-08-10  340  				       EMC2305_FAN_MAX);
0d8400c5a2ce159 Michael Shych   2022-08-10  341  	data->cdev_data[cdev_idx].last_hwmon_state =
0429415a084a154 Florin Leotescu 2025-06-03  342  		EMC2305_PWM_DUTY2STATE(pwm, data->max_state,
0d8400c5a2ce159 Michael Shych   2022-08-10  343  				       EMC2305_FAN_MAX);
0d8400c5a2ce159 Michael Shych   2022-08-10  344  	return 0;
0d8400c5a2ce159 Michael Shych   2022-08-10  345  }
0d8400c5a2ce159 Michael Shych   2022-08-10  346  

:::::: The code at line 312 was first introduced by commit
:::::: 2ed4db7a1d07b349b50e890dee3d0f245230d254 hwmon: (emc2305) Configure PWM channels based on DT properties

:::::: TO: Florin Leotescu <florin.leotescu@nxp.com>
:::::: CC: Guenter Roeck <linux@roeck-us.net>

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH] pmdomain: qcom: cpr: add COMPILE_TEST support
From: Ulf Hansson @ 2026-04-08 10:02 UTC (permalink / raw)
  To: Rosen Penev; +Cc: linux-pm, open list:ARM/QUALCOMM MAILING LIST, open list
In-Reply-To: <20260402025406.94272-1-rosenp@gmail.com>

On Thu, 2 Apr 2026 at 04:54, Rosen Penev <rosenp@gmail.com> wrote:
>
> Allows the buildbots to build the driver on other platforms. There's
> nothing special arch specific thing going on here.
>
> Signed-off-by: Rosen Penev <rosenp@gmail.com>

Applied for next, thanks!

Kind regards
Uffe


> ---
>  drivers/pmdomain/qcom/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/pmdomain/qcom/Kconfig b/drivers/pmdomain/qcom/Kconfig
> index 3d3948eabef0..72cbcfe7a0c9 100644
> --- a/drivers/pmdomain/qcom/Kconfig
> +++ b/drivers/pmdomain/qcom/Kconfig
> @@ -3,7 +3,7 @@ menu "Qualcomm PM Domains"
>
>  config QCOM_CPR
>         tristate "QCOM Core Power Reduction (CPR) support"
> -       depends on ARCH_QCOM && HAS_IOMEM
> +       depends on (ARCH_QCOM || COMPILE_TEST) && HAS_IOMEM
>         select PM_OPP
>         select REGMAP
>         help
> --
> 2.53.0
>

^ permalink raw reply

* Re: [PATCH v2 0/2] power: qcom,rpmpd: add RPMh power doamins support for Hawi SoC
From: Ulf Hansson @ 2026-04-08 10:02 UTC (permalink / raw)
  To: Fenglin Wu
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Konrad Dybcio, Subbaraman Narayanamurthy, linux-arm-msm,
	devicetree, linux-kernel, linux-pm, kernel, Taniya Das
In-Reply-To: <20260402-haw-rpmhpd-v2-0-2bce0767f2ca@oss.qualcomm.com>

On Fri, 3 Apr 2026 at 02:36, Fenglin Wu <fenglin.wu@oss.qualcomm.com> wrote:
>
> Add constant definitions for the new power domains and new voltage
> levels present in Hawi SoC. Also add RPMH power domain support for
> Hawi SoC.
>
> Signed-off-by: Fenglin Wu <fenglin.wu@oss.qualcomm.com>

The series applied for next, thanks!

Note, patch1 is also available on the immutable dt branch.

Kind regards
Uffe


> ---
> Changes in v2:
> - Squash patch 1 and 2 into a single binding change
> - Add trailers for the new patch 2
> - Link to v1: https://patch.msgid.link/20260401-haw-rpmhpd-v1-0-c830c79ed8f9@oss.qualcomm.com
>
> ---
> Fenglin Wu (2):
>       dt-bindings: power: qcom,rpmhpd: Add RPMh power domain for Hawi SoC
>       pmdomain: qcom: rpmhpd: Add power domains for Hawi SoC
>
>  .../devicetree/bindings/power/qcom,rpmpd.yaml      |  1 +
>  drivers/pmdomain/qcom/rpmhpd.c                     | 38 ++++++++++++++++++++++
>  include/dt-bindings/power/qcom,rpmhpd.h            | 12 +++++++
>  3 files changed, 51 insertions(+)
> ---
> base-commit: 33b1a2ee3a3df63e7a08e51e6de2b2d28ddf257f
> change-id: 20260401-haw-rpmhpd-b40a68a3ce79
>
> Best regards,
> --
> Fenglin Wu <fenglin.wu@oss.qualcomm.com>
>

^ permalink raw reply

* Re: [PATCH] cpufreq: fix race between hotplug and suspend
From: Rafael J. Wysocki @ 2026-04-08 10:27 UTC (permalink / raw)
  To: Tianxiang Chen; +Cc: rafael, viresh.kumar, lingyue, linux-pm, linux-kernel
In-Reply-To: <20260408014640.174420-1-nanmu@xiaomi.com>

On Wed, Apr 8, 2026 at 3:46 AM Tianxiang Chen <nanmu@xiaomi.com> wrote:
>
> On Tue, 7 Apr 2026, Rafael J. Wysocki wrote:
> > So how exactly would CPU hotplug be started during a system suspend or resume?
>
> Hi Rafael,
>
> Thank you for your question. Let me explain the two scenarios:
>
> 1. cpufreq_suspend() During Reboot (Confirmed Issue)

Which needs to be mentioned in the patch changelog.

> The real and reproducible race I encountered occurs during system reboot.
>
> Call chain:
>   kernel_restart() -> kernel_restart_prepare()
>             -> device_shutdown() -> cpufreq_suspend()
>
> Different from the regular suspend path, the reboot path does NOT call
> freeze_processes() at all.

That's correct.

> All userspace processes, drivers and kernel threads are
> still running when cpufreq_suspend() executes. This allows CPU hotplug
> (offline/online) operations to run concurrently with cpufreq_suspend().
>
> 2. System suspend/resume (Less Likely but Possible)
>
> CPU hotplug is less likely during system suspend/resume. However,
> non-freezable kernel threads may keep running throughout the entire
> process, which may still trigger CPU hotplug in theory.

Which would be a bug in the kernel thread in question.  So not really.

> So I added cpus_read_lock()/cpus_read_unlock() to block CPU hotplug
> while resume is in progress.

Please resend the patch with a changelog actually mentioning the
failure that you have observed.

Thanks!

^ permalink raw reply

* Re: [PATCH] thermal/of: Move OF code where it belongs to
From: Rafael J. Wysocki @ 2026-04-08 10:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Daniel Lezcano, Daniel Lezcano, Zhang Rui, Lukasz Luba,
	open list:THERMAL, open list
In-Reply-To: <CAJZ5v0hQyyQxGi3Zv9_asMhbKG-eki+4D=mzXBzwuf5x-AZQeQ@mail.gmail.com>

On Tue, Apr 7, 2026 at 7:10 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Apr 7, 2026 at 5:51 PM Daniel Lezcano <daniel.lezcano@kernel.org> wrote:
> >
> > From: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
> >
> > The functions:
> >  - thermal_of_cooling_device_register()
> >  - devm_thermal_of_cooling_device_register()
> >
> >  are related to thermal-of but they are implemented in
> >  thermal-core. Move these functions to the right file.
> >
> > Pure move patch.
> >
> > No functional change intended.
> >
> > Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
> > Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
> > ---
> >  drivers/thermal/thermal_core.c | 75 +---------------------------------
> >  drivers/thermal/thermal_core.h |  5 +++
> >  drivers/thermal/thermal_of.c   | 72 ++++++++++++++++++++++++++++++++
> >  3 files changed, 78 insertions(+), 74 deletions(-)
> >
> > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> > index b7d706ed7ed9..f0049cff1128 100644
> > --- a/drivers/thermal/thermal_core.c
> > +++ b/drivers/thermal/thermal_core.c
> > @@ -1054,7 +1054,7 @@ static void thermal_cooling_device_init_complete(struct thermal_cooling_device *
> >   * Return: a pointer to the created struct thermal_cooling_device or an
> >   * ERR_PTR. Caller must check return value with IS_ERR*() helpers.
> >   */
> > -static struct thermal_cooling_device *
> > +struct thermal_cooling_device *
> >  __thermal_cooling_device_register(struct device_node *np,
> >                                   const char *type, void *devdata,
> >                                   const struct thermal_cooling_device_ops *ops)
> > @@ -1162,79 +1162,6 @@ thermal_cooling_device_register(const char *type, void *devdata,
> >  }
> >  EXPORT_SYMBOL_GPL(thermal_cooling_device_register);
> >
> > -/**
> > - * thermal_of_cooling_device_register() - register an OF thermal cooling device
> > - * @np:                a pointer to a device tree node.
> > - * @type:      the thermal cooling device type.
> > - * @devdata:   device private data.
> > - * @ops:               standard thermal cooling devices callbacks.
> > - *
> > - * This function will register a cooling device with device tree node reference.
> > - * This interface function adds a new thermal cooling device (fan/processor/...)
> > - * to /sys/class/thermal/ folder as cooling_device[0-*]. It tries to bind itself
> > - * to all the thermal zone devices registered at the same time.
> > - *
> > - * Return: a pointer to the created struct thermal_cooling_device or an
> > - * ERR_PTR. Caller must check return value with IS_ERR*() helpers.
> > - */
> > -struct thermal_cooling_device *
> > -thermal_of_cooling_device_register(struct device_node *np,
> > -                                  const char *type, void *devdata,
> > -                                  const struct thermal_cooling_device_ops *ops)
> > -{
> > -       return __thermal_cooling_device_register(np, type, devdata, ops);
> > -}
> > -EXPORT_SYMBOL_GPL(thermal_of_cooling_device_register);
> > -
> > -static void thermal_cooling_device_release(struct device *dev, void *res)
> > -{
> > -       thermal_cooling_device_unregister(
> > -                               *(struct thermal_cooling_device **)res);
> > -}
> > -
> > -/**
> > - * devm_thermal_of_cooling_device_register() - register an OF thermal cooling
> > - *                                            device
> > - * @dev:       a valid struct device pointer of a sensor device.
> > - * @np:                a pointer to a device tree node.
> > - * @type:      the thermal cooling device type.
> > - * @devdata:   device private data.
> > - * @ops:       standard thermal cooling devices callbacks.
> > - *
> > - * This function will register a cooling device with device tree node reference.
> > - * This interface function adds a new thermal cooling device (fan/processor/...)
> > - * to /sys/class/thermal/ folder as cooling_device[0-*]. It tries to bind itself
> > - * to all the thermal zone devices registered at the same time.
> > - *
> > - * Return: a pointer to the created struct thermal_cooling_device or an
> > - * ERR_PTR. Caller must check return value with IS_ERR*() helpers.
> > - */
> > -struct thermal_cooling_device *
> > -devm_thermal_of_cooling_device_register(struct device *dev,
> > -                               struct device_node *np,
> > -                               const char *type, void *devdata,
> > -                               const struct thermal_cooling_device_ops *ops)
> > -{
> > -       struct thermal_cooling_device **ptr, *tcd;
> > -
> > -       ptr = devres_alloc(thermal_cooling_device_release, sizeof(*ptr),
> > -                          GFP_KERNEL);
> > -       if (!ptr)
> > -               return ERR_PTR(-ENOMEM);
> > -
> > -       tcd = __thermal_cooling_device_register(np, type, devdata, ops);
> > -       if (IS_ERR(tcd)) {
> > -               devres_free(ptr);
> > -               return tcd;
> > -       }
> > -
> > -       *ptr = tcd;
> > -       devres_add(dev, ptr);
> > -
> > -       return tcd;
> > -}
> > -EXPORT_SYMBOL_GPL(devm_thermal_of_cooling_device_register);
> > -
> >  static bool thermal_cooling_device_present(struct thermal_cooling_device *cdev)
> >  {
> >         struct thermal_cooling_device *pos = NULL;
> > diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h
> > index d3acff602f9c..bdd59947b24f 100644
> > --- a/drivers/thermal/thermal_core.h
> > +++ b/drivers/thermal/thermal_core.h
> > @@ -269,6 +269,11 @@ void thermal_zone_device_critical_shutdown(struct thermal_zone_device *tz);
> >  void thermal_governor_update_tz(struct thermal_zone_device *tz,
> >                                 enum thermal_notify_event reason);
> >
> > +struct thermal_cooling_device *
> > +__thermal_cooling_device_register(struct device_node *np,
> > +                                 const char *type, void *devdata,
> > +                                 const struct thermal_cooling_device_ops *ops);
> > +
> >  /* Helpers */
> >  #define for_each_trip_desc(__tz, __td) \
> >         for (__td = __tz->trips; __td - __tz->trips < __tz->num_trips; __td++)
> > diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c
> > index 99085c806a1f..398157e740fc 100644
> > --- a/drivers/thermal/thermal_of.c
> > +++ b/drivers/thermal/thermal_of.c
> > @@ -510,3 +510,75 @@ void devm_thermal_of_zone_unregister(struct device *dev, struct thermal_zone_dev
> >                                devm_thermal_of_zone_match, tz));
> >  }
> >  EXPORT_SYMBOL_GPL(devm_thermal_of_zone_unregister);
> > +
> > +/**
> > + * thermal_of_cooling_device_register() - register an OF thermal cooling device
> > + * @np:                a pointer to a device tree node.
> > + * @type:      the thermal cooling device type.
> > + * @devdata:   device private data.
> > + * @ops:               standard thermal cooling devices callbacks.
> > + *
> > + * This function will register a cooling device with device tree node reference.
> > + * This interface function adds a new thermal cooling device (fan/processor/...)
> > + * to /sys/class/thermal/ folder as cooling_device[0-*]. It tries to bind itself
> > + * to all the thermal zone devices registered at the same time.
> > + *
> > + * Return: a pointer to the created struct thermal_cooling_device or an
> > + * ERR_PTR. Caller must check return value with IS_ERR*() helpers.
> > + */
> > +struct thermal_cooling_device *
> > +thermal_of_cooling_device_register(struct device_node *np,
> > +                                  const char *type, void *devdata,
> > +                                  const struct thermal_cooling_device_ops *ops)
> > +{
> > +       return __thermal_cooling_device_register(np, type, devdata, ops);
> > +}
> > +EXPORT_SYMBOL_GPL(thermal_of_cooling_device_register);
> > +
> > +static void thermal_cooling_device_release(struct device *dev, void *res)
> > +{
> > +       thermal_cooling_device_unregister(*(struct thermal_cooling_device **)res);
> > +}
> > +
> > +/**
> > + * devm_thermal_of_cooling_device_register() - register an OF thermal cooling
> > + *                                            device
> > + * @dev:       a valid struct device pointer of a sensor device.
> > + * @np:                a pointer to a device tree node.
> > + * @type:      the thermal cooling device type.
> > + * @devdata:   device private data.
> > + * @ops:       standard thermal cooling devices callbacks.
> > + *
> > + * This function will register a cooling device with device tree node reference.
> > + * This interface function adds a new thermal cooling device (fan/processor/...)
> > + * to /sys/class/thermal/ folder as cooling_device[0-*]. It tries to bind itself
> > + * to all the thermal zone devices registered at the same time.
> > + *
> > + * Return: a pointer to the created struct thermal_cooling_device or an
> > + * ERR_PTR. Caller must check return value with IS_ERR*() helpers.
> > + */
> > +struct thermal_cooling_device *
> > +devm_thermal_of_cooling_device_register(struct device *dev,
> > +                                       struct device_node *np,
> > +                                       const char *type, void *devdata,
> > +                                       const struct thermal_cooling_device_ops *ops)
> > +{
> > +       struct thermal_cooling_device **ptr, *tcd;
> > +
> > +       ptr = devres_alloc(thermal_cooling_device_release, sizeof(*ptr),
> > +                          GFP_KERNEL);
> > +       if (!ptr)
> > +               return ERR_PTR(-ENOMEM);
> > +
> > +       tcd = __thermal_cooling_device_register(np, type, devdata, ops);
> > +       if (IS_ERR(tcd)) {
> > +               devres_free(ptr);
> > +               return tcd;
> > +       }
> > +
> > +       *ptr = tcd;
> > +       devres_add(dev, ptr);
> > +
> > +       return tcd;
> > +}
> > +EXPORT_SYMBOL_GPL(devm_thermal_of_cooling_device_register);
> > --
>
> Applied as 7.1 material, thanks!

And dropped because of a build issue introduced by it:

https://lore.kernel.org/linux-pm/202604081734.3OJSeExW-lkp@intel.com/

^ permalink raw reply

* [rafael-pm:bleeding-edge 170/268] ERROR: modpost: "thermal_of_cooling_device_register" [drivers/gpu/drm/etnaviv/etnaviv.ko] undefined!
From: kernel test robot @ 2026-04-08 10:58 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: oe-kbuild-all, linux-acpi, linux-pm, Rafael J. Wysocki

Hi Daniel,

FYI, the error/warning was bisected to this commit, please ignore it if it's irrelevant.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git bleeding-edge
head:   d18364264af84e2a89da14c6b5f0eae2ba7f98de
commit: e1b96fba58c6fe18a31a06f752ebc8ad6921b1cb [170/268] thermal/of: Move OF code where it belongs to
config: x86_64-randconfig-005-20260408 (https://download.01.org/0day-ci/archive/20260408/202604081848.Yh1jEbFo-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260408/202604081848.Yh1jEbFo-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202604081848.Yh1jEbFo-lkp@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "thermal_of_cooling_device_register" [drivers/gpu/drm/etnaviv/etnaviv.ko] undefined!
ERROR: modpost: "devm_thermal_of_cooling_device_register" [drivers/hwmon/dell-smm-hwmon.ko] undefined!

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [patch V2 00/11] hrtimers: Prevent hrtimer interrupt starvation
From: Thomas Gleixner @ 2026-04-08 11:53 UTC (permalink / raw)
  To: LKML
  Cc: Calvin Owens, Anna-Maria Behnsen, Frederic Weisbecker,
	Peter Zijlstra (Intel), John Stultz, Stephen Boyd, Alexander Viro,
	Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
	linux-pm, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	netfilter-devel, coreteam

This is a follow up to V1 which can be found here:

 https://lore.kernel.org/lkml/20260407083219.478203185@kernel.org

Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
up in user space:

  https://lore.kernel.org/lkml/acMe-QZUel-bBYUh@mozart.vkv.me/

He provided a reproducer, which sets up a timerfd based timer and then
rearms it in a loop with an absolute expiry time of 1ns.

As the expiry time is in the past, the timer ends up as the first expiring
timer in the per CPU hrtimer base and the clockevent device is programmed
with the minimum delta value. If the machine is fast enough, this ends up
in a endless loop of programming the delta value to the minimum value
defined by the clock event device, before the timer interrupt can fire,
which starves the interrupt and consequently triggers the lockup detector
because the hrtimer callback of the lockup mechanism is never invoked.

The first patch in the V1 series changes the clockevent set next event
mechanism to prevent reprogramming of the clockevent device when the
minimum delta value was programmed unless the new delta is larger than
that. It's a less convoluted variant of the patch which was posted in the
above linked thread and was confirmed to prevent the starvation problem.

But that's only to be considered the last resort because it results in an
insane amount of avoidable hrtimer interrupts. That patch has been merged
into the tip tree already.

The problem of user controlled timers is that the input value is only
sanity checked vs. validity of the provided timespec and clamped to be in
the maximum allowable range. But for performance reasons for in kernel
usage there is no check whether a to be armed timer might have been expired
already at enqueue time.

This series addresses this by providing a separate interface to arm user
controlled timers. This works the same way as the existing
hrtimer_start_range_ns(), but in case that the timer ends up as the first
timer in the clock base after enqueue it provides additional checks:

      - Whether the timer becomes the first expiring timer in the CPU base.

      	If not the timer is considered to expire in the future as there is
	already an earlier event programmed.

      - Whether the timer has expired already by comparing the expiry value
        against current time.

	If it is expired, the timer is removed from the clock base and the
	function returns false, so that the caller can handle it. That's
	required because the function cannot invoke the callback as that
	might need to acquire a lock which is held by the caller.

This function is then used for the user controlled timer arming interfaces
mainly by converting hrtimer sleeper over to it. That affects a few in
kernel users too, but the overhead is minimal in that case and it spares a
tedious whack the mole game all over the tree.

The other usage sites in posixtimers, alarmtimers and timerfd are converted
as well, which should cover the vast majority of user space controllable
timers as far as my investigation goes.

Changes vs. V1:

   - Dropped the clockevents patch as it is already merged

   - Rebased on tip timers/core

   - Moved the user check into hrtimer_start_range_ns_user() - Peter

   - Renamed alarmtimer_start() to alarm_start_timer() - Peter

   - Picked up tags as appropriate

The series applies against tip timers/core and is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git hrtimer-exp-v2

Thanks,

	tglx
---
 drivers/power/supply/charger-manager.c |   12 +-
 fs/timerfd.c                           |  117 ++++++++++++++++-----------
 include/linux/alarmtimer.h             |    9 +-
 include/linux/hrtimer.h                |   20 ++++
 include/trace/events/timer.h           |   13 +++
 kernel/time/alarmtimer.c               |   70 +++++++---------
 kernel/time/hrtimer.c                  |  140 ++++++++++++++++++++++++++++-----
 kernel/time/posix-cpu-timers.c         |   18 ++--
 kernel/time/posix-timers.c             |   35 +++++---
 kernel/time/posix-timers.h             |    4 
 net/netfilter/xt_IDLETIMER.c           |   24 ++++-
 11 files changed, 320 insertions(+), 142 deletions(-)

^ permalink raw reply

* [patch V2 01/11] hrtimer: Provide hrtimer_start_range_ns_user()
From: Thomas Gleixner @ 2026-04-08 11:53 UTC (permalink / raw)
  To: LKML
  Cc: Calvin Owens, Anna-Maria Behnsen, Frederic Weisbecker,
	Peter Zijlstra (Intel), John Stultz, Stephen Boyd, Alexander Viro,
	Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
	linux-pm, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	netfilter-devel, coreteam
In-Reply-To: <20260408102356.783133335@kernel.org>

Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
up in user space. He provided a reproducer, which set's up a timerfd based
timer and then rearms it in a loop with an absolute expiry time of 1ns.

As the expiry time is in the past, the timer ends up as the first expiring
timer in the per CPU hrtimer base and the clockevent device is programmed
with the minimum delta value. If the machine is fast enough, this ends up
in a endless loop of programming the delta value to the minimum value
defined by the clock event device, before the timer interrupt can fire,
which starves the interrupt and consequently triggers the lockup detector
because the hrtimer callback of the lockup mechanism is never invoked.

The clockevents code already has a last resort mechanism to prevent that,
but it's sensible to catch such issues before trying to reprogram the clock
event device.

Provide a variant of hrtimer_start_range_ns(), which sanity checks the
timer after queueing it. It does not so before because the timer might be
armed and therefore needs to be dequeued. also we optimize for the latest
possible point to check, so that the clock event prevention is avoided as
much as possible.

If the timer is already expired _before_ the clock event is reprogrammed,
remove the timer from the queue and signal to the caller that the operation
failed by returning false.

That allows the caller to take immediate action without going through the
loops and hoops of the hrtimer interrupt.

The queueing code can't invoke the timer callback as the caller might hold
a lock which is taken in the callback.

Add a tracepoint which allows to analyze the expired at start situation.

Reported-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
---
V2: Moved the user check into hrtimer_start_range_ns_user() and handled
    the NONE case explictly. - PeterZ
    Rebased on tip timers/core
---
 include/linux/hrtimer.h      |   20 +++++-
 include/trace/events/timer.h |   13 ++++
 kernel/time/hrtimer.c        |  134 +++++++++++++++++++++++++++++++++++++------
 3 files changed, 148 insertions(+), 19 deletions(-)
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -206,6 +206,9 @@ static inline void destroy_hrtimer_on_st
 extern void hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 				   u64 range_ns, const enum hrtimer_mode mode);
 
+extern bool hrtimer_start_range_ns_user(struct hrtimer *timer, ktime_t tim,
+					u64 range_ns, const enum hrtimer_mode mode);
+
 /**
  * hrtimer_start - (re)start an hrtimer
  * @timer:	the timer to be added
@@ -223,17 +226,28 @@ static inline void hrtimer_start(struct
 extern int hrtimer_cancel(struct hrtimer *timer);
 extern int hrtimer_try_to_cancel(struct hrtimer *timer);
 
-static inline void hrtimer_start_expires(struct hrtimer *timer,
-					 enum hrtimer_mode mode)
+static inline void hrtimer_start_expires(struct hrtimer *timer, enum hrtimer_mode mode)
 {
-	u64 delta;
 	ktime_t soft, hard;
+	u64 delta;
+
 	soft = hrtimer_get_softexpires(timer);
 	hard = hrtimer_get_expires(timer);
 	delta = ktime_to_ns(ktime_sub(hard, soft));
 	hrtimer_start_range_ns(timer, soft, delta, mode);
 }
 
+static inline bool hrtimer_start_expires_user(struct hrtimer *timer, enum hrtimer_mode mode)
+{
+	ktime_t soft, hard;
+	u64 delta;
+
+	soft = hrtimer_get_softexpires(timer);
+	hard = hrtimer_get_expires(timer);
+	delta = ktime_to_ns(ktime_sub(hard, soft));
+	return hrtimer_start_range_ns_user(timer, soft, delta, mode);
+}
+
 void hrtimer_sleeper_start_expires(struct hrtimer_sleeper *sl,
 				   enum hrtimer_mode mode);
 
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -299,6 +299,19 @@ DECLARE_EVENT_CLASS(hrtimer_class,
 );
 
 /**
+ * hrtimer_start_expired - Invoked when a expired timer was started
+ * @hrtimer:	pointer to struct hrtimer
+ *
+ * Preceeded by a hrtimer_start tracepoint.
+ */
+DEFINE_EVENT(hrtimer_class, hrtimer_start_expired,
+
+	TP_PROTO(struct hrtimer *hrtimer),
+
+	TP_ARGS(hrtimer)
+);
+
+/**
  * hrtimer_expire_exit - called immediately after the hrtimer callback returns
  * @hrtimer:	pointer to struct hrtimer
  *
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1352,6 +1352,12 @@ static inline bool hrtimer_keep_base(str
 	return hrtimer_prefer_local(is_local, is_first, is_pinned);
 }
 
+enum {
+	HRTIMER_REPROGRAM_NONE,
+	HRTIMER_REPROGRAM,
+	HRTIMER_REPROGRAM_FORCE,
+};
+
 static bool __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, u64 delta_ns,
 				     const enum hrtimer_mode mode, struct hrtimer_clock_base *base)
 {
@@ -1410,7 +1416,7 @@ static bool __hrtimer_start_range_ns(str
 	/* If a deferred rearm is pending skip reprogramming the device */
 	if (cpu_base->deferred_rearm) {
 		cpu_base->deferred_needs_update = true;
-		return false;
+		return HRTIMER_REPROGRAM_NONE;
 	}
 
 	if (!was_first || cpu_base != this_cpu_base) {
@@ -1423,7 +1429,7 @@ static bool __hrtimer_start_range_ns(str
 		 * callbacks.
 		 */
 		if (likely(hrtimer_base_is_online(this_cpu_base)))
-			return first;
+			return first ? HRTIMER_REPROGRAM : HRTIMER_REPROGRAM_NONE;
 
 		/*
 		 * Timer was enqueued remote because the current base is
@@ -1432,7 +1438,7 @@ static bool __hrtimer_start_range_ns(str
 		 */
 		if (first)
 			smp_call_function_single_async(cpu_base->cpu, &cpu_base->csd);
-		return false;
+		return HRTIMER_REPROGRAM_NONE;
 	}
 
 	/*
@@ -1446,7 +1452,7 @@ static bool __hrtimer_start_range_ns(str
 	 */
 	if (timer->is_lazy) {
 		if (cpu_base->expires_next <= hrtimer_get_expires(timer))
-			return false;
+			return HRTIMER_REPROGRAM_NONE;
 	}
 
 	/*
@@ -1455,8 +1461,24 @@ static bool __hrtimer_start_range_ns(str
 	 * reprogram the hardware by evaluating the new first expiring
 	 * timer.
 	 */
-	hrtimer_force_reprogram(cpu_base, /* skip_equal */ true);
-	return false;
+	return HRTIMER_REPROGRAM_FORCE;
+}
+
+static int hrtimer_start_range_ns_common(struct hrtimer *timer, ktime_t tim,
+					 u64 delta_ns, const enum hrtimer_mode mode,
+					 struct hrtimer_clock_base *base)
+{
+	/*
+	 * Check whether the HRTIMER_MODE_SOFT bit and hrtimer.is_soft
+	 * match on CONFIG_PREEMPT_RT = n. With PREEMPT_RT check the hard
+	 * expiry mode because unmarked timers are moved to softirq expiry.
+	 */
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		WARN_ON_ONCE(!(mode & HRTIMER_MODE_SOFT) ^ !timer->is_soft);
+	else
+		WARN_ON_ONCE(!(mode & HRTIMER_MODE_HARD) ^ !timer->is_hard);
+
+	return __hrtimer_start_range_ns(timer, tim, delta_ns, mode, base);
 }
 
 /**
@@ -1476,24 +1498,104 @@ void hrtimer_start_range_ns(struct hrtim
 
 	debug_hrtimer_assert_init(timer);
 
+	base = lock_hrtimer_base(timer, &flags);
+
+	switch (hrtimer_start_range_ns_common(timer, tim, delta_ns, mode, base)) {
+	case HRTIMER_REPROGRAM:
+		hrtimer_reprogram(timer, true);
+		break;
+	case HRTIMER_REPROGRAM_FORCE:
+		hrtimer_force_reprogram(timer->base->cpu_base, 1);
+		break;
+	case HRTIMER_REPROGRAM_NONE:
+		break;
+	}
+
+	unlock_hrtimer_base(timer, &flags);
+}
+EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);
+
+static inline bool hrtimer_check_user_timer(struct hrtimer *timer)
+{
+	struct hrtimer_cpu_base *cpu_base = timer->base->cpu_base;
+	ktime_t expires;
+
 	/*
-	 * Check whether the HRTIMER_MODE_SOFT bit and hrtimer.is_soft
-	 * match on CONFIG_PREEMPT_RT = n. With PREEMPT_RT check the hard
-	 * expiry mode because unmarked timers are moved to softirq expiry.
+	 * This uses soft expires because that's the user provided
+	 * expiry time, while expires can be further in the past
+	 * due to a slack value added to the user expiry time.
 	 */
-	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
-		WARN_ON_ONCE(!(mode & HRTIMER_MODE_SOFT) ^ !timer->is_soft);
-	else
-		WARN_ON_ONCE(!(mode & HRTIMER_MODE_HARD) ^ !timer->is_hard);
+	expires = hrtimer_get_softexpires(timer);
+
+	/* Convert to monotonic */
+	expires = ktime_sub(expires, timer->base->offset);
+
+	/*
+	 * Check whether this timer will end up as the first expiring timer in
+	 * the CPU base. If not, no further checks required as it's then
+	 * guaranteed to expire in the future.
+	 */
+	if (expires >= cpu_base->expires_next)
+		return true;
+
+	/* Validate that the expiry time is in the future. */
+	if (expires > ktime_get())
+		return true;
+
+	debug_hrtimer_deactivate(timer);
+	__remove_hrtimer(timer, timer->base, HRTIMER_STATE_INACTIVE, false);
+	trace_hrtimer_start_expired(timer);
+	return false;
+}
+
+/**
+ * hrtimer_start_range_ns_user - (re)start an user controlled hrtimer
+ * @timer:	the timer to be added
+ * @tim:	expiry time
+ * @delta_ns:	"slack" range for the timer
+ * @mode:	timer mode: absolute (HRTIMER_MODE_ABS) or
+ *		relative (HRTIMER_MODE_REL), and pinned (HRTIMER_MODE_PINNED);
+ *		softirq based mode is considered for debug purpose only!
+ *
+ * Returns: True when the timer was queued, false if it was already expired
+ *
+ * This function cannot invoke the timer callback for expired timers as it might
+ * be called under a lock which the timer callback needs to acquire. So the
+ * caller has to handle that case.
+ */
+bool hrtimer_start_range_ns_user(struct hrtimer *timer, ktime_t tim,
+				 u64 delta_ns, const enum hrtimer_mode mode)
+{
+	struct hrtimer_clock_base *base;
+	unsigned long flags;
+	bool ret = true;
+
+	debug_hrtimer_assert_init(timer);
 
 	base = lock_hrtimer_base(timer, &flags);
 
-	if (__hrtimer_start_range_ns(timer, tim, delta_ns, mode, base))
-		hrtimer_reprogram(timer, true);
+	switch (hrtimer_start_range_ns_common(timer, tim, delta_ns, mode, base)) {
+	case HRTIMER_REPROGRAM:
+		ret = hrtimer_check_user_timer(timer);
+		if (ret)
+			hrtimer_reprogram(timer, true);
+		break;
+	case HRTIMER_REPROGRAM_FORCE:
+		ret = hrtimer_check_user_timer(timer);
+		/*
+		 * The base must always be reevaluated, independent of the
+		 * result above because the timer was the first pending timer.
+		 */
+		hrtimer_force_reprogram(timer->base->cpu_base, 1);
+		break;
+	case HRTIMER_REPROGRAM_NONE:
+		break;
+	}
 
 	unlock_hrtimer_base(timer, &flags);
+	return ret;
 }
-EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);
+EXPORT_SYMBOL_GPL(hrtimer_start_range_ns_user);
 
 /**
  * hrtimer_try_to_cancel - try to deactivate a timer


^ permalink raw reply

* [patch V2 02/11] hrtimer: Use hrtimer_start_expires_user() for hrtimer sleepers
From: Thomas Gleixner @ 2026-04-08 11:53 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra (Intel), Anna-Maria Behnsen, Frederic Weisbecker,
	Calvin Owens, John Stultz, Stephen Boyd, Alexander Viro,
	Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
	linux-pm, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	netfilter-devel, coreteam
In-Reply-To: <20260408102356.783133335@kernel.org>

Most hrtimer sleepers are user controlled and user space can hand arbitrary
expiry values in as long as they are valid timespecs. If the expiry value
is in the past then this requires a full loop through reprogramming the
clock event device, taking the hrtimer interrupt, waking the task and
reprogram again.

Use hrtimer_start_expires_user() which avoids the full round trip by
checking the timer for expiry on enqueue.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>

---
 kernel/time/hrtimer.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2152,7 +2152,11 @@ void hrtimer_sleeper_start_expires(struc
 	if (IS_ENABLED(CONFIG_PREEMPT_RT) && sl->timer.is_hard)
 		mode |= HRTIMER_MODE_HARD;
 
-	hrtimer_start_expires(&sl->timer, mode);
+	/* If already expired, clear the task pointer and set current state to running */
+	if (!hrtimer_start_expires_user(&sl->timer, mode)) {
+		sl->task = NULL;
+		__set_current_state(TASK_RUNNING);
+	}
 }
 EXPORT_SYMBOL_GPL(hrtimer_sleeper_start_expires);
 




^ permalink raw reply

* [patch V2 03/11] posix-timers: Expand timer_[re]arm() callbacks with a boolean return value
From: Thomas Gleixner @ 2026-04-08 11:53 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra (Intel), John Stultz, Stephen Boyd,
	Anna-Maria Behnsen, Frederic Weisbecker, Calvin Owens,
	Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
	Sebastian Reichel, linux-pm, Pablo Neira Ayuso, Florian Westphal,
	Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <20260408102356.783133335@kernel.org>

In order to catch expiry times which are already in the past the
timer_arm() and timer_rearm() callbacks need to be able to report back to
the caller whether the timer has been queued or not.

Change the function signature and let all implementations return true for
now. While at it simplify posix_cpu_timer_rearm().

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <jstultz@google.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>

---
 kernel/time/alarmtimer.c       |    6 ++++--
 kernel/time/posix-cpu-timers.c |   18 ++++++++++--------
 kernel/time/posix-timers.c     |    6 ++++--
 kernel/time/posix-timers.h     |    4 ++--
 4 files changed, 20 insertions(+), 14 deletions(-)
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -527,12 +527,13 @@ static void alarm_handle_timer(struct al
  * alarm_timer_rearm - Posix timer callback for rearming timer
  * @timr:	Pointer to the posixtimer data struct
  */
-static void alarm_timer_rearm(struct k_itimer *timr)
+static bool alarm_timer_rearm(struct k_itimer *timr)
 {
 	struct alarm *alarm = &timr->it.alarm.alarmtimer;
 
 	timr->it_overrun += alarm_forward_now(alarm, timr->it_interval);
 	alarm_start(alarm, alarm->node.expires);
+	return true;
 }
 
 /**
@@ -588,7 +589,7 @@ static void alarm_timer_wait_running(str
  * @absolute:	Expiry value is absolute time
  * @sigev_none:	Posix timer does not deliver signals
  */
-static void alarm_timer_arm(struct k_itimer *timr, ktime_t expires,
+static bool alarm_timer_arm(struct k_itimer *timr, ktime_t expires,
 			    bool absolute, bool sigev_none)
 {
 	struct alarm *alarm = &timr->it.alarm.alarmtimer;
@@ -600,6 +601,7 @@ static void alarm_timer_arm(struct k_iti
 		alarm->node.expires = expires;
 	else
 		alarm_start(&timr->it.alarm.alarmtimer, expires);
+	return true;
 }
 
 /**
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -19,7 +19,7 @@
 
 #include "posix-timers.h"
 
-static void posix_cpu_timer_rearm(struct k_itimer *timer);
+static bool posix_cpu_timer_rearm(struct k_itimer *timer);
 
 void posix_cputimers_group_init(struct posix_cputimers *pct, u64 cpu_limit)
 {
@@ -1011,24 +1011,27 @@ static void check_process_timers(struct
 /*
  * This is called from the signal code (via posixtimer_rearm)
  * when the last timer signal was delivered and we have to reload the timer.
+ *
+ * Return true unconditionally so the core code assumes the timer to be
+ * armed. Otherwise it would requeue the signal.
  */
-static void posix_cpu_timer_rearm(struct k_itimer *timer)
+static bool posix_cpu_timer_rearm(struct k_itimer *timer)
 {
 	clockid_t clkid = CPUCLOCK_WHICH(timer->it_clock);
-	struct task_struct *p;
 	struct sighand_struct *sighand;
+	struct task_struct *p;
 	unsigned long flags;
 	u64 now;
 
-	rcu_read_lock();
+	guard(rcu)();
 	p = cpu_timer_task_rcu(timer);
 	if (!p)
-		goto out;
+		return true;
 
 	/* Protect timer list r/w in arm_timer() */
 	sighand = lock_task_sighand(p, &flags);
 	if (unlikely(sighand == NULL))
-		goto out;
+		return true;
 
 	/*
 	 * Fetch the current sample and update the timer's expiry time.
@@ -1045,8 +1048,7 @@ static void posix_cpu_timer_rearm(struct
 	 */
 	arm_timer(timer, p);
 	unlock_task_sighand(p, &flags);
-out:
-	rcu_read_unlock();
+	return true;
 }
 
 /**
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -288,12 +288,13 @@ static inline int timer_overrun_to_int(s
 	return (int)timr->it_overrun_last;
 }
 
-static void common_hrtimer_rearm(struct k_itimer *timr)
+static bool common_hrtimer_rearm(struct k_itimer *timr)
 {
 	struct hrtimer *timer = &timr->it.real.timer;
 
 	timr->it_overrun += hrtimer_forward_now(timer, timr->it_interval);
 	hrtimer_restart(timer);
+	return true;
 }
 
 static bool __posixtimer_deliver_signal(struct kernel_siginfo *info, struct k_itimer *timr)
@@ -795,7 +796,7 @@ SYSCALL_DEFINE1(timer_getoverrun, timer_
 		return timer_overrun_to_int(scoped_timer);
 }
 
-static void common_hrtimer_arm(struct k_itimer *timr, ktime_t expires,
+static bool common_hrtimer_arm(struct k_itimer *timr, ktime_t expires,
 			       bool absolute, bool sigev_none)
 {
 	struct hrtimer *timer = &timr->it.real.timer;
@@ -822,6 +823,7 @@ static void common_hrtimer_arm(struct k_
 
 	if (!sigev_none)
 		hrtimer_start_expires(timer, HRTIMER_MODE_ABS);
+	return true;
 }
 
 static int common_hrtimer_try_to_cancel(struct k_itimer *timr)
--- a/kernel/time/posix-timers.h
+++ b/kernel/time/posix-timers.h
@@ -27,11 +27,11 @@ struct k_clock {
 	int	(*timer_del)(struct k_itimer *timr);
 	void	(*timer_get)(struct k_itimer *timr,
 			     struct itimerspec64 *cur_setting);
-	void	(*timer_rearm)(struct k_itimer *timr);
+	bool	(*timer_rearm)(struct k_itimer *timr);
 	s64	(*timer_forward)(struct k_itimer *timr, ktime_t now);
 	ktime_t	(*timer_remaining)(struct k_itimer *timr, ktime_t now);
 	int	(*timer_try_to_cancel)(struct k_itimer *timr);
-	void	(*timer_arm)(struct k_itimer *timr, ktime_t expires,
+	bool	(*timer_arm)(struct k_itimer *timr, ktime_t expires,
 			     bool absolute, bool sigev_none);
 	void	(*timer_wait_running)(struct k_itimer *timr);
 };


^ permalink raw reply

* [patch V2 04/11] posix-timers: Handle the timer_[re]arm() return value
From: Thomas Gleixner @ 2026-04-08 11:54 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra (Intel), Anna-Maria Behnsen, Frederic Weisbecker,
	Calvin Owens, John Stultz, Stephen Boyd, Alexander Viro,
	Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
	linux-pm, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	netfilter-devel, coreteam
In-Reply-To: <20260408102356.783133335@kernel.org>

The [re]arm callbacks will return true when the timer was queued and false
if it was already expired at enqueue time.

In both cases the call sites can trivially queue the signal right there,
when the timer was already expired. That avoids a full round trip through
the hrtimer interrupt.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>

---
 kernel/time/posix-timers.c |   22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -299,6 +299,8 @@ static bool common_hrtimer_rearm(struct
 
 static bool __posixtimer_deliver_signal(struct kernel_siginfo *info, struct k_itimer *timr)
 {
+	bool queued;
+
 	guard(spinlock)(&timr->it_lock);
 
 	/*
@@ -312,12 +314,18 @@ static bool __posixtimer_deliver_signal(
 	if (!timr->it_interval || WARN_ON_ONCE(timr->it_status != POSIX_TIMER_REQUEUE_PENDING))
 		return true;
 
-	timr->kclock->timer_rearm(timr);
-	timr->it_status = POSIX_TIMER_ARMED;
+	/* timer_rearm() updates timr::it_overrun */
+	queued = timr->kclock->timer_rearm(timr);
+
 	timr->it_overrun_last = timr->it_overrun;
 	timr->it_overrun = -1LL;
 	++timr->it_signal_seq;
 	info->si_overrun = timer_overrun_to_int(timr);
+
+	if (queued)
+		timr->it_status = POSIX_TIMER_ARMED;
+	else
+		posix_timer_queue_signal(timr);
 	return true;
 }
 
@@ -905,9 +913,13 @@ int common_timer_set(struct k_itimer *ti
 		expires = timens_ktime_to_host(timr->it_clock, expires);
 	sigev_none = timr->it_sigev_notify == SIGEV_NONE;
 
-	kc->timer_arm(timr, expires, flags & TIMER_ABSTIME, sigev_none);
-	if (!sigev_none)
-		timr->it_status = POSIX_TIMER_ARMED;
+	if (kc->timer_arm(timr, expires, flags & TIMER_ABSTIME, sigev_none)) {
+		if (!sigev_none)
+			timr->it_status = POSIX_TIMER_ARMED;
+	} else {
+		/* Timer was already expired, queue the signal */
+		posix_timer_queue_signal(timr);
+	}
 	return 0;
 }
 




^ permalink raw reply

* [patch V2 05/11] posix-timers: Switch to hrtimer_start_expires_user()
From: Thomas Gleixner @ 2026-04-08 11:54 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra (Intel), Anna-Maria Behnsen, Frederic Weisbecker,
	Calvin Owens, John Stultz, Stephen Boyd, Alexander Viro,
	Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
	linux-pm, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	netfilter-devel, coreteam
In-Reply-To: <20260408102356.783133335@kernel.org>

Switch the arm and rearm callbacks for hrtimer based posix timers over to
hrtimer_start_expires_user() so that already expired timers are not
queued. Hand the result back to the caller, which then queues the signal.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>

---
 kernel/time/posix-timers.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -293,8 +293,7 @@ static bool common_hrtimer_rearm(struct
 	struct hrtimer *timer = &timr->it.real.timer;
 
 	timr->it_overrun += hrtimer_forward_now(timer, timr->it_interval);
-	hrtimer_restart(timer);
-	return true;
+	return hrtimer_start_expires_user(timer, HRTIMER_MODE_ABS);
 }
 
 static bool __posixtimer_deliver_signal(struct kernel_siginfo *info, struct k_itimer *timr)
@@ -829,9 +828,11 @@ static bool common_hrtimer_arm(struct k_
 		expires = ktime_add_safe(expires, hrtimer_cb_get_time(timer));
 	hrtimer_set_expires(timer, expires);
 
-	if (!sigev_none)
-		hrtimer_start_expires(timer, HRTIMER_MODE_ABS);
-	return true;
+	/* For sigev_none pretend that the timer is queued */
+	if (sigev_none)
+		return true;
+
+	return hrtimer_start_expires_user(timer, HRTIMER_MODE_ABS);
 }
 
 static int common_hrtimer_try_to_cancel(struct k_itimer *timr)




^ permalink raw reply

* [patch V2 06/11] alarmtimer: Provide alarm_start_timer()
From: Thomas Gleixner @ 2026-04-08 11:54 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Stephen Boyd, Calvin Owens, Anna-Maria Behnsen,
	Frederic Weisbecker, Peter Zijlstra (Intel), Alexander Viro,
	Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
	linux-pm, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	netfilter-devel, coreteam
In-Reply-To: <20260408102356.783133335@kernel.org>

Alarm timers utilize hrtimers for normal operation and only switch to the
RTC on suspend. In order to catch already expired timers early and without
going through a timer interrupt cycle, provide a new start function which
internally uses hrtimer_start_range_ns_user().

If hrtimer_start_range_ns_user() detects an already expired timer, it does
not queue it. In that case remove the timer from the alarm base as well.

Return the status queued or not back to the caller to handle the early
expiry.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Cc: Stephen Boyd <sboyd@kernel.org>
---
V2: Rename to alarm_start_timer() - Peter
---
 include/linux/alarmtimer.h |    6 ++++++
 kernel/time/alarmtimer.c   |   28 ++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)
--- a/include/linux/alarmtimer.h
+++ b/include/linux/alarmtimer.h
@@ -42,8 +42,14 @@ struct alarm {
 	void			*data;
 };
 
+static __always_inline ktime_t alarm_get_expires(struct alarm *alarm)
+{
+	return alarm->node.expires;
+}
+
 void alarm_init(struct alarm *alarm, enum alarmtimer_type type,
 		void (*function)(struct alarm *, ktime_t));
+bool alarm_start_timer(struct alarm *alarm, ktime_t expires, bool relative);
 void alarm_start(struct alarm *alarm, ktime_t start);
 void alarm_start_relative(struct alarm *alarm, ktime_t start);
 void alarm_restart(struct alarm *alarm);
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -365,6 +365,34 @@ void alarm_start_relative(struct alarm *
 }
 EXPORT_SYMBOL_GPL(alarm_start_relative);
 
+/**
+ * alarm_start_timer - Sets an alarm to fire
+ * @alarm:	Pointer to alarm to set
+ * @expires:	Expiry time
+ * @relative:	True if @expires is relative
+ *
+ * Returns: True if the alarm was queued. False if it already expired
+ */
+bool alarm_start_timer(struct alarm *alarm, ktime_t expires, bool relative)
+{
+	struct alarm_base *base = &alarm_bases[alarm->type];
+
+	if (relative)
+		expires = ktime_add_safe(expires, base->get_ktime());
+
+	trace_alarmtimer_start(alarm, base->get_ktime());
+
+	guard(spinlock_irqsave)(&base->lock);
+	alarm->node.expires = expires;
+	alarmtimer_enqueue(base, alarm);
+	if (!hrtimer_start_range_ns_user(&alarm->timer, expires, 0, HRTIMER_MODE_ABS)) {
+		alarmtimer_dequeue(base, alarm);
+		return false;
+	}
+	return true;
+}
+EXPORT_SYMBOL_GPL(alarm_start_timer);
+
 void alarm_restart(struct alarm *alarm)
 {
 	struct alarm_base *base = &alarm_bases[alarm->type];


^ permalink raw reply

* [patch V2 07/11] alarmtimer: Convert posix timer functions to alarm_start_timer()
From: Thomas Gleixner @ 2026-04-08 11:54 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Stephen Boyd, Calvin Owens, Anna-Maria Behnsen,
	Frederic Weisbecker, Peter Zijlstra (Intel), Alexander Viro,
	Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
	linux-pm, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	netfilter-devel, coreteam
In-Reply-To: <20260408102356.783133335@kernel.org>

Use the new alarm_start_timer() for arming and rearming posix interval
timers and for clock_nanosleep() so that already expired timers do not go
through the full timer interrupt cycle.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Cc: Stephen Boyd <sboyd@kernel.org>
---
V2: Rename to alarm_start_timer()
---
 kernel/time/alarmtimer.c |   20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -560,8 +560,7 @@ static bool alarm_timer_rearm(struct k_i
 	struct alarm *alarm = &timr->it.alarm.alarmtimer;
 
 	timr->it_overrun += alarm_forward_now(alarm, timr->it_interval);
-	alarm_start(alarm, alarm->node.expires);
-	return true;
+	return alarm_start_timer(alarm, alarm->node.expires, false);
 }
 
 /**
@@ -625,11 +624,16 @@ static bool alarm_timer_arm(struct k_iti
 
 	if (!absolute)
 		expires = ktime_add_safe(expires, base->get_ktime());
-	if (sigev_none)
+
+	/*
+	 * sigev_none needs to update the expires value and pretend
+	 * that the timer is queued
+	 */
+	if (sigev_none) {
 		alarm->node.expires = expires;
-	else
-		alarm_start(&timr->it.alarm.alarmtimer, expires);
-	return true;
+		return true;
+	}
+	return alarm_start_timer(&timr->it.alarm.alarmtimer, expires, false);
 }
 
 /**
@@ -736,7 +740,9 @@ static int alarmtimer_do_nsleep(struct a
 	alarm->data = (void *)current;
 	do {
 		set_current_state(TASK_INTERRUPTIBLE);
-		alarm_start(alarm, absexp);
+		if (!alarm_start_timer(alarm, absexp, false))
+			alarm->data = NULL;
+
 		if (likely(alarm->data))
 			schedule();
 


^ permalink raw reply

* [patch V2 08/11] fs/timerfd: Use the new alarm/hrtimer functions
From: Thomas Gleixner @ 2026-04-08 11:54 UTC (permalink / raw)
  To: LKML
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Anna-Maria Behnsen,
	Frederic Weisbecker, linux-fsdevel, Calvin Owens,
	Peter Zijlstra (Intel), John Stultz, Stephen Boyd,
	Sebastian Reichel, linux-pm, Pablo Neira Ayuso, Florian Westphal,
	Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <20260408102356.783133335@kernel.org>

Like any other user controlled interface, timerfd based timers can be
programmed with expiry times in the past or vary small intervals.

Both hrtimer and alarmtimer provide new interfaces which return the queued
state of the timer. If the timer was already expired, then let the callsite
handle the timerfd context update so that the full round trip through the
hrtimer interrupt is avoided.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
---
V2: Rename to alarm_timer_start() and add a comment explaining the -1 in
    the tick accounting. - Peter
---
 fs/timerfd.c |  117 ++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 68 insertions(+), 49 deletions(-)
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -55,6 +55,15 @@ static inline bool isalarm(struct timerf
 		ctx->clockid == CLOCK_BOOTTIME_ALARM;
 }
 
+static void __timerfd_triggered(struct timerfd_ctx *ctx)
+{
+	lockdep_assert_held(&ctx->wqh.lock);
+
+	ctx->expired = 1;
+	ctx->ticks++;
+	wake_up_locked_poll(&ctx->wqh, EPOLLIN);
+}
+
 /*
  * This gets called when the timer event triggers. We set the "expired"
  * flag, but we do not re-arm the timer (in case it's necessary,
@@ -62,13 +71,8 @@ static inline bool isalarm(struct timerf
  */
 static void timerfd_triggered(struct timerfd_ctx *ctx)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&ctx->wqh.lock, flags);
-	ctx->expired = 1;
-	ctx->ticks++;
-	wake_up_locked_poll(&ctx->wqh, EPOLLIN);
-	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
+	guard(spinlock_irqsave)(&ctx->wqh.lock);
+	__timerfd_triggered(ctx);
 }
 
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
@@ -184,15 +188,54 @@ static ktime_t timerfd_get_remaining(str
 	return remaining < 0 ? 0: remaining;
 }
 
+static void timerfd_alarm_start(struct timerfd_ctx *ctx, ktime_t exp, bool relative)
+{
+	/* Start the timer. If it's expired already, handle the callback. */
+	if (!alarm_start_timer(&ctx->t.alarm, exp, relative))
+		__timerfd_triggered(ctx);
+}
+
+static u64 timerfd_alarm_restart(struct timerfd_ctx *ctx)
+{
+	/* -1 to account for ctx->ticks++ in __timerfd_triggered() */
+	u64 ticks = alarm_forward_now(&ctx->t.alarm, ctx->tintv) - 1;
+
+	timerfd_alarm_start(ctx, alarm_get_expires(&ctx->t.alarm), false);
+	return ticks;
+}
+
+static void timerfd_hrtimer_start(struct timerfd_ctx *ctx, ktime_t exp,
+				  const enum hrtimer_mode mode)
+{
+	/* Start the timer. If it's expired already, handle the callback. */
+	if (!hrtimer_start_range_ns_user(&ctx->t.tmr, exp, 0, mode))
+		__timerfd_triggered(ctx);
+}
+
+static u64 timerfd_hrtimer_restart(struct timerfd_ctx *ctx)
+{
+	/* -1 to account for ctx->ticks++ in __timerfd_triggered() */
+	u64 ticks = hrtimer_forward_now(&ctx->t.tmr, ctx->tintv) - 1;
+
+	timerfd_hrtimer_start(ctx, hrtimer_get_expires(&ctx->t.tmr), HRTIMER_MODE_ABS);
+	return ticks;
+}
+
+static u64 timerfd_restart(struct timerfd_ctx *ctx)
+{
+	if (isalarm(ctx))
+		return timerfd_alarm_restart(ctx);
+	return timerfd_hrtimer_restart(ctx);
+}
+
 static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
 			 const struct itimerspec64 *ktmr)
 {
+	int clockid = ctx->clockid;
 	enum hrtimer_mode htmode;
 	ktime_t texp;
-	int clockid = ctx->clockid;
 
-	htmode = (flags & TFD_TIMER_ABSTIME) ?
-		HRTIMER_MODE_ABS: HRTIMER_MODE_REL;
+	htmode = (flags & TFD_TIMER_ABSTIME) ? HRTIMER_MODE_ABS: HRTIMER_MODE_REL;
 
 	texp = timespec64_to_ktime(ktmr->it_value);
 	ctx->expired = 0;
@@ -206,20 +249,15 @@ static int timerfd_setup(struct timerfd_
 			   timerfd_alarmproc);
 	} else {
 		hrtimer_setup(&ctx->t.tmr, timerfd_tmrproc, clockid, htmode);
-		hrtimer_set_expires(&ctx->t.tmr, texp);
 	}
 
 	if (texp != 0) {
 		if (flags & TFD_TIMER_ABSTIME)
 			texp = timens_ktime_to_host(clockid, texp);
-		if (isalarm(ctx)) {
-			if (flags & TFD_TIMER_ABSTIME)
-				alarm_start(&ctx->t.alarm, texp);
-			else
-				alarm_start_relative(&ctx->t.alarm, texp);
-		} else {
-			hrtimer_start(&ctx->t.tmr, texp, htmode);
-		}
+		if (isalarm(ctx))
+			timerfd_alarm_start(ctx, texp, !(flags & TFD_TIMER_ABSTIME));
+		else
+			timerfd_hrtimer_start(ctx, texp, htmode);
 
 		if (timerfd_canceled(ctx))
 			return -ECANCELED;
@@ -287,27 +325,19 @@ static ssize_t timerfd_read_iter(struct
 	}
 
 	if (ctx->ticks) {
-		ticks = ctx->ticks;
+		unsigned int expired = ctx->expired;
 
-		if (ctx->expired && ctx->tintv) {
-			/*
-			 * If tintv != 0, this is a periodic timer that
-			 * needs to be re-armed. We avoid doing it in the timer
-			 * callback to avoid DoS attacks specifying a very
-			 * short timer period.
-			 */
-			if (isalarm(ctx)) {
-				ticks += alarm_forward_now(
-					&ctx->t.alarm, ctx->tintv) - 1;
-				alarm_restart(&ctx->t.alarm);
-			} else {
-				ticks += hrtimer_forward_now(&ctx->t.tmr,
-							     ctx->tintv) - 1;
-				hrtimer_restart(&ctx->t.tmr);
-			}
-		}
+		ticks = ctx->ticks;
 		ctx->expired = 0;
 		ctx->ticks = 0;
+
+		/*
+		 * If tintv != 0, this is a periodic timer that needs to be
+		 * re-armed. We avoid doing it in the timer callback to avoid
+		 * DoS attacks specifying a very short timer period.
+		 */
+		if (expired && ctx->tintv)
+			ticks += timerfd_restart(ctx);
 	}
 	spin_unlock_irq(&ctx->wqh.lock);
 	if (ticks) {
@@ -526,18 +556,7 @@ static int do_timerfd_gettime(int ufd, s
 	spin_lock_irq(&ctx->wqh.lock);
 	if (ctx->expired && ctx->tintv) {
 		ctx->expired = 0;
-
-		if (isalarm(ctx)) {
-			ctx->ticks +=
-				alarm_forward_now(
-					&ctx->t.alarm, ctx->tintv) - 1;
-			alarm_restart(&ctx->t.alarm);
-		} else {
-			ctx->ticks +=
-				hrtimer_forward_now(&ctx->t.tmr, ctx->tintv)
-				- 1;
-			hrtimer_restart(&ctx->t.tmr);
-		}
+		ctx->ticks += timerfd_restart(ctx);
 	}
 	t->it_value = ktime_to_timespec64(timerfd_get_remaining(ctx));
 	t->it_interval = ktime_to_timespec64(ctx->tintv);


^ permalink raw reply

* [patch V2 09/11] power: supply: charger-manager: Switch to alarm_start_timer()
From: Thomas Gleixner @ 2026-04-08 11:54 UTC (permalink / raw)
  To: LKML
  Cc: Sebastian Reichel, linux-pm, Calvin Owens, Anna-Maria Behnsen,
	Frederic Weisbecker, Peter Zijlstra (Intel), John Stultz,
	Stephen Boyd, Alexander Viro, Christian Brauner, Jan Kara,
	linux-fsdevel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	netfilter-devel, coreteam
In-Reply-To: <20260408102356.783133335@kernel.org>

The existing alarm_start() interface is replaced with the new
alarm_start_timer() mechanism, which does not longer queue an already
expired timer and returns the state. Adjust the code to utilize this.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: linux-pm@vger.kernel.org
---
V2: Rename to alarm_start_timer()
---
 drivers/power/supply/charger-manager.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)
--- a/drivers/power/supply/charger-manager.c
+++ b/drivers/power/supply/charger-manager.c
@@ -881,7 +881,7 @@ static bool cm_setup_timer(void)
 	mutex_unlock(&cm_list_mtx);
 
 	if (timer_req && cm_timer) {
-		ktime_t now, add;
+		ktime_t exp;
 
 		/*
 		 * Set alarm with the polling interval (wakeup_ms)
@@ -893,14 +893,16 @@ static bool cm_setup_timer(void)
 
 		pr_info("Charger Manager wakeup timer: %u ms\n", wakeup_ms);
 
-		now = ktime_get_boottime();
-		add = ktime_set(wakeup_ms / MSEC_PER_SEC,
+		exp = ktime_set(wakeup_ms / MSEC_PER_SEC,
 				(wakeup_ms % MSEC_PER_SEC) * NSEC_PER_MSEC);
-		alarm_start(cm_timer, ktime_add(now, add));
 
 		cm_suspend_duration_ms = wakeup_ms;
 
-		return true;
+		/*
+		 * The timer should always be queued as the timeout is at least
+		 * two seconds out. Handle it correctly nevertheless.
+		 */
+		return alarm_start_timer(cm_timer, exp, true);
 	}
 	return false;
 }


^ permalink raw reply

* [patch V2 10/11] netfilter: xt_IDLETIMER: Switch to alarm_start_timer()
From: Thomas Gleixner @ 2026-04-08 11:54 UTC (permalink / raw)
  To: LKML
  Cc: Pablo Neira Ayuso, Florian Westphal, Phil Sutter, netfilter-devel,
	coreteam, Calvin Owens, Anna-Maria Behnsen, Frederic Weisbecker,
	Peter Zijlstra (Intel), John Stultz, Stephen Boyd, Alexander Viro,
	Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
	linux-pm
In-Reply-To: <20260408102356.783133335@kernel.org>

The existing alarm_start() interface is replaced with the new
alarm_start_timer() mechanism, which does not longer queue an already
expired timer and returns the state.

Adjust the code to utilize this so it schedules the work in the case that
the timer was already expired. Unlikely to happen as the timeout is at
least a second, but not impossible especially with virtualization.

No functional change intended

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: Phil Sutter <phil@nwl.cc>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org

---
 net/netfilter/xt_IDLETIMER.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)
--- a/net/netfilter/xt_IDLETIMER.c
+++ b/net/netfilter/xt_IDLETIMER.c
@@ -115,6 +115,21 @@ static void idletimer_tg_alarmproc(struc
 	schedule_work(&timer->work);
 }
 
+static void idletimer_start_alarm_ktime(struct idletimer_tg *timer, ktime_t timeout)
+{
+	/*
+	 * The timer should always be queued as @tout it should be least one
+	 * second, but handle it correctly in any case. Virt will manage!
+	 */
+	if (!alarm_start_timer(&timer->alarm, timeout, true))
+		schedule_work(&timer->work);
+}
+
+static void idletimer_start_alarm_sec(struct idletimer_tg *timer, unsigned int seconds)
+{
+	idletimer_start_alarm_ktime(timer, ktime_set(seconds, 0));
+}
+
 static int idletimer_check_sysfs_name(const char *name, unsigned int size)
 {
 	int ret;
@@ -220,12 +235,10 @@ static int idletimer_tg_create_v1(struct
 	INIT_WORK(&info->timer->work, idletimer_tg_work);
 
 	if (info->timer->timer_type & XT_IDLETIMER_ALARM) {
-		ktime_t tout;
 		alarm_init(&info->timer->alarm, ALARM_BOOTTIME,
 			   idletimer_tg_alarmproc);
 		info->timer->alarm.data = info->timer;
-		tout = ktime_set(info->timeout, 0);
-		alarm_start_relative(&info->timer->alarm, tout);
+		idletimer_start_alarm_sec(info->timer, info->timeout);
 	} else {
 		timer_setup(&info->timer->timer, idletimer_tg_expired, 0);
 		mod_timer(&info->timer->timer,
@@ -271,8 +284,7 @@ static unsigned int idletimer_tg_target_
 		 info->label, info->timeout);
 
 	if (info->timer->timer_type & XT_IDLETIMER_ALARM) {
-		ktime_t tout = ktime_set(info->timeout, 0);
-		alarm_start_relative(&info->timer->alarm, tout);
+		idletimer_start_alarm_sec(info->timer, info->timeout);
 	} else {
 		mod_timer(&info->timer->timer,
 				secs_to_jiffies(info->timeout) + jiffies);
@@ -378,7 +390,7 @@ static int idletimer_tg_checkentry_v1(co
 			if (ktimespec.tv_sec > 0) {
 				pr_debug("time_expiry_remaining %lld\n",
 					 ktimespec.tv_sec);
-				alarm_start_relative(&info->timer->alarm, tout);
+				idletimer_start_alarm_ktime(info->timer, tout);
 			}
 		} else {
 				mod_timer(&info->timer->timer,


^ permalink raw reply

* [patch V2 11/11] alarmtimer: Remove unused interfaces
From: Thomas Gleixner @ 2026-04-08 11:54 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Stephen Boyd, Calvin Owens, Anna-Maria Behnsen,
	Frederic Weisbecker, Peter Zijlstra (Intel), Alexander Viro,
	Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
	linux-pm, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	netfilter-devel, coreteam
In-Reply-To: <20260408102356.783133335@kernel.org>

All alarmtimer users are converted to alarm_start_timer(). Remove the now
unused interfaces.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: John Stultz <jstultz@google.com>
Cc: Stephen Boyd <sboyd@kernel.org>

---
 include/linux/alarmtimer.h |    3 ---
 kernel/time/alarmtimer.c   |   44 --------------------------------------------
 2 files changed, 47 deletions(-)
--- a/include/linux/alarmtimer.h
+++ b/include/linux/alarmtimer.h
@@ -50,9 +50,6 @@ static __always_inline ktime_t alarm_get
 void alarm_init(struct alarm *alarm, enum alarmtimer_type type,
 		void (*function)(struct alarm *, ktime_t));
 bool alarm_start_timer(struct alarm *alarm, ktime_t expires, bool relative);
-void alarm_start(struct alarm *alarm, ktime_t start);
-void alarm_start_relative(struct alarm *alarm, ktime_t start);
-void alarm_restart(struct alarm *alarm);
 int alarm_try_to_cancel(struct alarm *alarm);
 int alarm_cancel(struct alarm *alarm);
 
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -333,39 +333,6 @@ void alarm_init(struct alarm *alarm, enu
 EXPORT_SYMBOL_GPL(alarm_init);
 
 /**
- * alarm_start - Sets an absolute alarm to fire
- * @alarm: ptr to alarm to set
- * @start: time to run the alarm
- */
-void alarm_start(struct alarm *alarm, ktime_t start)
-{
-	struct alarm_base *base = &alarm_bases[alarm->type];
-
-	scoped_guard(spinlock_irqsave, &base->lock) {
-		alarm->node.expires = start;
-		alarmtimer_enqueue(base, alarm);
-		hrtimer_start(&alarm->timer, alarm->node.expires, HRTIMER_MODE_ABS);
-	}
-
-	trace_alarmtimer_start(alarm, base->get_ktime());
-}
-EXPORT_SYMBOL_GPL(alarm_start);
-
-/**
- * alarm_start_relative - Sets a relative alarm to fire
- * @alarm: ptr to alarm to set
- * @start: time relative to now to run the alarm
- */
-void alarm_start_relative(struct alarm *alarm, ktime_t start)
-{
-	struct alarm_base *base = &alarm_bases[alarm->type];
-
-	start = ktime_add_safe(start, base->get_ktime());
-	alarm_start(alarm, start);
-}
-EXPORT_SYMBOL_GPL(alarm_start_relative);
-
-/**
  * alarm_start_timer - Sets an alarm to fire
  * @alarm:	Pointer to alarm to set
  * @expires:	Expiry time
@@ -393,17 +360,6 @@ bool alarm_start_timer(struct alarm *ala
 }
 EXPORT_SYMBOL_GPL(alarm_start_timer);
 
-void alarm_restart(struct alarm *alarm)
-{
-	struct alarm_base *base = &alarm_bases[alarm->type];
-
-	guard(spinlock_irqsave)(&base->lock);
-	hrtimer_set_expires(&alarm->timer, alarm->node.expires);
-	hrtimer_restart(&alarm->timer);
-	alarmtimer_enqueue(base, alarm);
-}
-EXPORT_SYMBOL_GPL(alarm_restart);
-
 /**
  * alarm_try_to_cancel - Tries to cancel an alarm timer
  * @alarm: ptr to alarm to be canceled


^ permalink raw reply

* [PATCH v11 01/14] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora
In-Reply-To: <20260408122538.3610871-1-ankur.a.arora@oracle.com>

Add smp_cond_load_relaxed_timeout(), which extends
smp_cond_load_relaxed() to allow waiting for a duration.

We loop around waiting for the condition variable to change while
peridically doing a time-check. The loop uses cpu_poll_relax() to slow
down the busy-wait, which, unless overridden by the architecture
code, amounts to a cpu_relax().

Note that there are two ways for the time-check to fail: the timeout
case or, @time_expr_ns returning an invalid value (negative or zero).
The second failure mode allows for clocks attached to the clock-domain
of @cond_expr --  which might cease to operate meaningfully once some
state internal to @cond_expr has changed -- to fail.

Evaluation of @time_expr_ns: in the fastpath we want to keep the
performance close to smp_cond_load_relaxed(). So defer evaluation
of the potentially costly @time_expr_ns to the slowpath.

This also means that there will always be some hardware dependent
duration that has passed in cpu_poll_relax() iterations at the time
of first evaluation. Additionally cpu_poll_relax() is not guaranteed
to return at timeout boundary. In sum, expect timeout overshoot when
we exit due to expiration of the timeout.

The number of spin iterations before time-check, SMP_TIMEOUT_POLL_COUNT
is chosen to be 200 by default. With a cpu_poll_relax() iteration
taking ~20-30 cycles (measured on a variety of x86 platforms), we
expect a time-check every ~4000-6000 cycles.

The outer limit of the overshoot is double that when working with the
parameters above. This might be higher or lower depending on the
implementation of cpu_poll_relax() across architectures.

Lastly, config option ARCH_HAS_CPU_RELAX indicates availability of a
cpu_poll_relax() that is cheaper than polling. This might be relevant
for cases with a long timeout.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
Notes:
   - add a comment mentioning that smp_cond_load_relaxed_timeout() might
     be using architectural primitives that don't support MMIO.
     (David Laight, Catalin Marinas)

 include/asm-generic/barrier.h | 69 +++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..e5a6a1c04649 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,75 @@ do {									\
 })
 #endif

+/*
+ * Number of times we iterate in the loop before doing the time check.
+ * Note that the iteration count assumes that the loop condition is
+ * relatively cheap.
+ */
+#ifndef SMP_TIMEOUT_POLL_COUNT
+#define SMP_TIMEOUT_POLL_COUNT		200
+#endif
+
+/*
+ * Platforms with ARCH_HAS_CPU_RELAX have a cpu_poll_relax() implementation
+ * that is expected to be cheaper (lower power) than pure polling.
+ */
+#ifndef cpu_poll_relax
+#define cpu_poll_relax(ptr, val, timeout_ns)	cpu_relax()
+#endif
+
+/**
+ * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
+ * guarantees until a timeout expires.
+ * @ptr: pointer to the variable to wait on.
+ * @cond_expr: boolean expression to wait for.
+ * @time_expr_ns: expression that evaluates to monotonic time (in ns) or,
+ *  on failure, returns a negative value.
+ * @timeout_ns: timeout value in ns
+ * Both of the above are assumed to be compatible with s64; the signed
+ * value is used to handle the failure case in @time_expr_ns.
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ *
+ * Callers that expect to wait for prolonged durations might want
+ * to take into account the availability of ARCH_HAS_CPU_RELAX.
+ *
+ * Note that @ptr is expected to point to a memory address. Using this
+ * interface with MMIO will be slower (since SMP_TIMEOUT_POLL_COUNT is
+ * tuned for memory) and might also break in interesting architecture
+ * dependent ways.
+ */
+#ifndef smp_cond_load_relaxed_timeout
+#define smp_cond_load_relaxed_timeout(ptr, cond_expr,			\
+				      time_expr_ns, timeout_ns)		\
+({									\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT;			\
+	s64 __timeout = (s64)timeout_ns;				\
+	s64 __time_now, __time_end = 0;					\
+									\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		cpu_poll_relax(__PTR, VAL, (u64)__timeout);		\
+		if (++__n < __spin)					\
+			continue;					\
+		__time_now = (s64)(time_expr_ns);			\
+		if (unlikely(__time_end == 0))				\
+			__time_end = __time_now + __timeout;		\
+		__timeout = __time_end - __time_now;			\
+		if (__time_now <= 0 || __timeout <= 0) {		\
+			VAL = READ_ONCE(*__PTR);			\
+			break;						\
+		}							\
+		__n = 0;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+#endif
+
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have
-- 
2.31.1

^ permalink raw reply related

* [PATCH v11 00/14] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
From: Ankur Arora @ 2026-04-08 12:25 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, ashok.bhat, Ankur Arora

Hi,

Main change in this version:
  - adds a kunit validation test.

What remains?:
  - Review by PeterZ of the new interface tif_need_resched_relaxed_wait()
    (patch 11, "sched: add need-resched timed wait interface").
    (Peter had originally proposed using smp_cond_load_relaxed() in
     poll_idle() [11]).

The core kernel often uses smp_cond_load_{relaxed,acquire}() to spin
on condition variables with architectural primitives used to avoid
hammering the relevant cachelines.

(This primitive can vary greatly across architectures: on x86 it's a
cpu_relax() to slow down the pipeline. On arm64, this is a __cmpwait()
which waits for a cacheline to change state in a time limited fashion.)

Regardless of architectural details, typical smp_cond_load*() usage
does not allow for termination until the condition change occurs.

Beyond the core kernel, there are cases where it is useful to additionally
terminate on a timeout. Two cases:

  - cpuidle poll_idle(): wait for need-resched until the cpuidle polling
    duration expires.

  - rqspinlock: nested qspinlock acquisition that terminates on timeout
    or deadlock.

Accordingly add two interfaces (with their generic and arm64 specific
implementations):

   smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr, timeout)
   smp_cond_load_acquire_timeout(ptr, cond_expr, time_expr, timeout)

Also add tif_need_resched_relaxed_wait() which wraps the polling
pattern and its scheduler specific details in poll_idle().
In addition add atomic_cond_read_*_timeout(),
atomic64_cond_read_*_timeout(), and atomic_long wrappers.

Structurally, both the smp_cond_load_*_timeout() interfaces are similar
to smp_cond_load*(), with the addition of a rate-limited time-check.

Usage
==

These interfaces drop straight-forwardly into the rqspinlock logic
since qspinlock already uses smp_cond_load*(), and the time-check
extension can now be used for timeout and deadlock handling.

Using tif_need_resched_relaxed_wait() in poll_idle() removes any
architectural details allowing arm64 to straight-forwardly support
that path.
(However, for efficiency reasons cpuidle/poll_state.c continues to
depend on ARCH_HAS_CPU_RELAX since that is defined on architectures
with an optimized architectural primitive.)


Performance
==

Apart from simplifications due to this change, supporting polling in
cpuidle on arm64 helps improve wakeup latency (needs a few cpuidle/acpi
patches):


  # perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
  perf bench sched pipe -l 1000000 -c 4

  # No haltpoll (and, no TIF_POLLING_NRFLAG):

  Performance counter stats for 'CPU(s) 4,5' (5 runs):

         25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
    45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
    26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
                 0      sched:sched_wake_idle_without_ipi #    0.000 /sec

       12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )


  # Haltpoll:

  Performance counter stats for 'CPU(s) 4,5' (5 runs):

         15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
    34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
    20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
         1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                       ( +-  0.78% )

        7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )

  We get improved latency because we don't switch in and out of a
  deeper sleep state or from the hypervisor. This also causes us to
  execute ~20% fewer instructions.


Haris Okanovic also saw improvement in real workloads due to the
cpuidle changes: "observed 4-6% improvements in memcahed, cassandra,
mysql, and postgresql under certain loads. Other applications likely
benefit too." [12]


Changelog:
  v10 [10]:
   - add a comment mentioning that smp_cond_load_relaxed_timeout() might
     be using architectural primitives that don't support MMIO.
     (David Laight, Catalin Marinas)
   - added a kunit test for smp_cond_load_relaxed_timeout() (Andrew
     Morton.)

  v9 [9]:
   - s/@cond/@cond_expr/ (Randy Dunlap)
   - Clarify that SMP_TIMEOUT_POLL_COUNT is only around memory
     addresses. (David Laight)
   - Add the missing config ARCH_HAS_CPU_RELAX in arch/arm64/Kconfig.
     (Catalin Marinas).
   - Switch to arch_counter_get_cntvct_stable() (via __delay_cycles())
     in the cmpwait path instead of using arch_timer_read_counter().
     (Catalin Marinas)

  v8 [0]:
   - Defer evaluation of @time_expr_ns to when we hit the slowpath.
      (comment from Alexei Starovoitov).

   - Mention that cpu_poll_relax() is better than raw CPU polling
     only where ARCH_HAS_CPU_RELAX is defined.
     - also define ARCH_HAS_CPU_RELAX for arm64.
      (Came out of a discussion with Will Deacon.)

   - Split out WFET and WFE handling. I was doing both of these
     in a common handler.
     (From Will Deacon and in an earlier revision by Catalin Marinas.)

   - Add mentions of atomic_cond_read_{relaxed,acquire}(),
     atomic_cond_read_{relaxed,acquire}_timeout() in
     Documentation/atomic_t.txt.

   - Use the BIT() macro to do the checking in tif_bitset_relaxed_wait().

   - Cleanup unnecessary assignments, casts etc in poll_idle().
     (From Rafael Wysocki.)

   - Fixup warnings from kernel build robot


  v7 [1]:
   - change the interface to separately provide the timeout. This is
     useful for supporting WFET and similar primitives which can do
     timed waiting (suggested by Arnd Bergmann).

   - Adapting rqspinlock code to this changed interface also
     necessitated allowing time_expr to fail.
   - rqspinlock changes to adapt to the new smp_cond_load_acquire_timeout().

   - add WFET support (suggested by Arnd Bergmann).
   - add support for atomic-long wrappers.
   - add a new scheduler interface tif_need_resched_relaxed_wait() which
     encapsulates the polling logic used by poll_idle().
     - interface suggested by (Rafael J. Wysocki).


  v6 [2]:
   - fixup missing timeout parameters in atomic64_cond_read_*_timeout()
   - remove a race between setting of TIF_NEED_RESCHED and the call to
     smp_cond_load_relaxed_timeout(). This would mean that dev->poll_time_limit
     would be set even if we hadn't spent any time waiting.
     (The original check compared against local_clock(), which would have been
     fine, but I was instead using a cheaper check against _TIF_NEED_RESCHED.)
   (Both from meta-CI bot)


  v5 [3]:
   - use cpu_poll_relax() instead of cpu_relax().
   - instead of defining an arm64 specific
     smp_cond_load_relaxed_timeout(), just define the appropriate
     cpu_poll_relax().
   - re-read the target pointer when we exit due to the time-check.
   - s/SMP_TIMEOUT_SPIN_COUNT/SMP_TIMEOUT_POLL_COUNT/
   (Suggested by Will Deacon)

   - add atomic_cond_read_*_timeout() and atomic64_cond_read_*_timeout()
     interfaces.
   - rqspinlock: use atomic_cond_read_acquire_timeout().
   - cpuidle: use smp_cond_load_relaxed_tiemout() for polling.
   (Suggested by Catalin Marinas)

   - rqspinlock: define SMP_TIMEOUT_POLL_COUNT to be 16k for non arm64


  v4 [4]:
    - naming change 's/timewait/timeout/'
    - resilient spinlocks: get rid of res_smp_cond_load_acquire_waiting()
      and fixup use of RES_CHECK_TIMEOUT().
    (Both suggested by Catalin Marinas)

  v3 [5]:
    - further interface simplifications (suggested by Catalin Marinas)

  v2 [6]:
    - simplified the interface (suggested by Catalin Marinas)
       - get rid of wait_policy, and a multitude of constants
       - adds a slack parameter
      This helped remove a fair amount of duplicated code duplication and in
      hindsight unnecessary constants.

  v1 [7]:
     - add wait_policy (coarse and fine)
     - derive spin-count etc at runtime instead of using arbitrary
       constants.

Haris Okanovic tested v4 of this series with poll_idle()/haltpoll patches. [8]

Comments appreciated!

Thanks
Ankur

 [0] https://lore.kernel.org/lkml/20251215044919.460086-1-ankur.a.arora@oracle.com/
 [1] https://lore.kernel.org/lkml/20251028053136.692462-1-ankur.a.arora@oracle.com/
 [2] https://lore.kernel.org/lkml/20250911034655.3916002-1-ankur.a.arora@oracle.com/
 [3] https://lore.kernel.org/lkml/20250911034655.3916002-1-ankur.a.arora@oracle.com/
 [4] https://lore.kernel.org/lkml/20250829080735.3598416-1-ankur.a.arora@oracle.com/
 [5] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/
 [6] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/
 [7] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
 [8] https://lore.kernel.org/lkml/2cecbf7fb23ee83a4ce027e1be3f46f97efd585c.camel@amazon.com/
 [9] https://lore.kernel.org/lkml/20260209023153.2661784-1-ankur.a.arora@oracle.com/
 [10] https://lore.kernel.org/lkml/20260316013651.3225328-1-ankur.a.arora@oracle.com/
 [11] https://lore.kernel.org/lkml/20230809134837.GM212435@hirez.programming.kicks-ass.net/
 [12] https://lore.kernel.org/lkml/c6f3c8d3f1f2e89a9dc7ae22482973b5a51b08cb.camel@amazon.com/

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: bpf@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-pm@vger.kernel.org

Ankur Arora (14):
  asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
  arm64: barrier: Support smp_cond_load_relaxed_timeout()
  arm64/delay: move some constants out to a separate header
  arm64: support WFET in smp_cond_load_relaxed_timeout()
  arm64: rqspinlock: Remove private copy of
    smp_cond_load_acquire_timewait()
  asm-generic: barrier: Add smp_cond_load_acquire_timeout()
  atomic: Add atomic_cond_read_*_timeout()
  locking/atomic: scripts: build atomic_long_cond_read_*_timeout()
  bpf/rqspinlock: switch check_timeout() to a clock interface
  bpf/rqspinlock: Use smp_cond_load_acquire_timeout()
  sched: add need-resched timed wait interface
  cpuidle/poll_state: Wait for need-resched via
    tif_need_resched_relaxed_wait()
  kunit: enable testing smp_cond_load_relaxed_timeout()
  kunit: add tests for smp_cond_load_relaxed_timeout()

 Documentation/atomic_t.txt           |  14 +--
 arch/arm64/Kconfig                   |   3 +
 arch/arm64/include/asm/barrier.h     |  23 +++++
 arch/arm64/include/asm/cmpxchg.h     |  62 ++++++++++---
 arch/arm64/include/asm/delay-const.h |  27 ++++++
 arch/arm64/include/asm/rqspinlock.h  |  85 ------------------
 arch/arm64/lib/delay.c               |  17 ++--
 drivers/clocksource/arm_arch_timer.c |   2 +
 drivers/cpuidle/poll_state.c         |  21 +----
 drivers/soc/qcom/rpmh-rsc.c          |   8 +-
 include/asm-generic/barrier.h        |  95 ++++++++++++++++++++
 include/linux/atomic.h               |  10 +++
 include/linux/atomic/atomic-long.h   |  18 ++--
 include/linux/sched/idle.h           |  29 +++++++
 kernel/bpf/rqspinlock.c              |  77 +++++++++++------
 lib/Kconfig.debug                    |  10 +++
 lib/tests/Makefile                   |   1 +
 lib/tests/barrier-timeout-test.c     | 125 +++++++++++++++++++++++++++
 scripts/atomic/gen-atomic-long.sh    |  16 ++--
 19 files changed, 465 insertions(+), 178 deletions(-)
 create mode 100644 arch/arm64/include/asm/delay-const.h
 create mode 100644 lib/tests/barrier-timeout-test.c

-- 
2.31.1


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox