Linux Power Management development
 help / color / mirror / Atom feed
* Re: [PATCH v4 07/13] mfd: sec: set DMA coherent mask
From: Krzysztof Kozlowski @ 2026-04-15  7:19 UTC (permalink / raw)
  To: Kaustabh Chakraborty
  Cc: Lee Jones, Pavel Machek, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, MyungJoo Ham, Chanwoo Choi, Sebastian Reichel,
	André Draszik, Alexandre Belloni, Jonathan Corbet,
	Shuah Khan, Nam Tran, Łukasz Lebiedziński, linux-leds,
	devicetree, linux-kernel, linux-pm, linux-samsung-soc, linux-rtc,
	linux-doc
In-Reply-To: <20260414-s2mu005-pmic-v4-7-7fe7480577e6@disroot.org>

On Tue, Apr 14, 2026 at 12:02:59PM +0530, Kaustabh Chakraborty wrote:
> Kernel logs are filled with "DMA mask not set" messages for every
> sub-device. The device does not use DMA for communication, so these
> messages are useless. Disable the coherent DMA mask for the PMIC device,
> which is also propagated to sub-devices.
> 
> Signed-off-by: Kaustabh Chakraborty <kauschluss@disroot.org>
> ---
>  Documentation/devicetree/bindings/mfd/samsung,s2mps11.yaml | 3 +++
>  drivers/mfd/sec-common.c                                   | 3 +++
>  2 files changed, 6 insertions(+)
>

Please run scripts/checkpatch.pl on the patches and fix reported
warnings. After that, run also 'scripts/checkpatch.pl --strict' on the
patches and (probably) fix more warnings. Some warnings can be ignored,
especially from --strict run, but the code here looks like it needs a
fix. Feel free to get in touch if the warning is not clear.

Best regards,
Krzysztof


^ permalink raw reply

* Re: [PATCH v4 04/13] dt-bindings: power: supply: document Samsung S2M series PMIC charger device
From: Krzysztof Kozlowski @ 2026-04-15  7:18 UTC (permalink / raw)
  To: Kaustabh Chakraborty
  Cc: Lee Jones, Pavel Machek, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, MyungJoo Ham, Chanwoo Choi, Sebastian Reichel,
	André Draszik, Alexandre Belloni, Jonathan Corbet,
	Shuah Khan, Nam Tran, Łukasz Lebiedziński, linux-leds,
	devicetree, linux-kernel, linux-pm, linux-samsung-soc, linux-rtc,
	linux-doc
In-Reply-To: <20260414-s2mu005-pmic-v4-4-7fe7480577e6@disroot.org>

On Tue, Apr 14, 2026 at 12:02:56PM +0530, Kaustabh Chakraborty wrote:
> +description: |
> +  The Samsung S2M series PMIC battery charger manages power interfacing
> +  of the USB port. It may supply power, as done in USB OTG operation
> +  mode, or it may accept power and redirect it to the battery fuelgauge
> +  for charging.
> +
> +  This is a part of device tree bindings for S2M and S5M family of Power
> +  Management IC (PMIC).
> +
> +  See also Documentation/devicetree/bindings/mfd/samsung,s2mps11.yaml for
> +  additional information and example.
> +
> +allOf:
> +  - $ref: power-supply.yaml#
> +
> +properties:
> +  compatible:
> +    enum:
> +      - samsung,s2mu005-charger
> +
> +  port:
> +    $ref: /schemas/graph.yaml#/properties/port

That port is internal part of the device, thus should be dropped which
leaves you with only one property - monitored battery - and therefore
fold the node into the parent node.

Best regards,
Krzysztof


^ permalink raw reply

* Re: [PATCH v4 05/13] dt-bindings: mfd: s2mps11: add documentation for S2MU005 PMIC
From: Krzysztof Kozlowski @ 2026-04-15  7:17 UTC (permalink / raw)
  To: Kaustabh Chakraborty
  Cc: Lee Jones, Pavel Machek, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, MyungJoo Ham, Chanwoo Choi, Sebastian Reichel,
	André Draszik, Alexandre Belloni, Jonathan Corbet,
	Shuah Khan, Nam Tran, Łukasz Lebiedziński, linux-leds,
	devicetree, linux-kernel, linux-pm, linux-samsung-soc, linux-rtc,
	linux-doc
In-Reply-To: <20260414-s2mu005-pmic-v4-5-7fe7480577e6@disroot.org>

On Tue, Apr 14, 2026 at 12:02:57PM +0530, Kaustabh Chakraborty wrote:
> Samsung's S2MU005 PMIC includes subdevices for a charger, an MUIC (Micro
> USB Interface Controller), and flash and RGB LED controllers.
> 
> Since regulators are not supported by this device, unmark this property
> as required and instead set this in a per-device basis for ones which
> need it.
> 
> Add the compatible and documentation for the S2MU005 PMIC. Also, add an
> example for nodes for supported sub-devices, i.e. charger, extcon,
> flash, and rgb.
> 

Limited review because this does not pass build checks.

> Signed-off-by: Kaustabh Chakraborty <kauschluss@disroot.org>
> ---
>  .../devicetree/bindings/mfd/samsung,s2mps11.yaml   | 121 ++++++++++++++++++++-
>  1 file changed, 120 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/devicetree/bindings/mfd/samsung,s2mps11.yaml b/Documentation/devicetree/bindings/mfd/samsung,s2mps11.yaml
> index ac5d0c149796b..d3d305b9aa765 100644
> --- a/Documentation/devicetree/bindings/mfd/samsung,s2mps11.yaml
> +++ b/Documentation/devicetree/bindings/mfd/samsung,s2mps11.yaml
> @@ -26,12 +26,28 @@ properties:
>        - samsung,s2mps15-pmic
>        - samsung,s2mpu02-pmic
>        - samsung,s2mpu05-pmic
> +      - samsung,s2mu005-pmic
>  
>    clocks:
>      $ref: /schemas/clock/samsung,s2mps11.yaml
>      description:
>        Child node describing clock provider.
>  
> +  charger:
> +    $ref: /schemas/power/supply/samsung,s2mu005-charger.yaml
> +    description:
> +      Child node describing battery charger device.
> +
> +  extcon:

You got comment to drop extcon naming. If this stays, it's muic for
example.

> +    $ref: /schemas/extcon/samsung,s2mu005-muic.yaml
> +    description:
> +      Child node describing extcon device.
> +
> +  flash:
> +    $ref: /schemas/leds/samsung,s2mu005-flash.yaml
> +    description:
> +      Child node describing flash LEDs.
> +

Please make it a separate binding file.

>    interrupts:
>      maxItems: 1
>  
> @@ -43,6 +59,11 @@ properties:
>      description:
>        List of child nodes that specify the regulators.
>  
> +  rgb:

led

> +    $ref: /schemas/leds/samsung,s2mu005-rgb.yaml
> +    description:
> +      Child node describing RGB LEDs.
> +
>    samsung,s2mps11-acokb-ground:
>      description: |
>        Indicates that ACOKB pin of S2MPS11 PMIC is connected to the ground so
> @@ -63,7 +84,6 @@ properties:
>  required:
>    - compatible
>    - reg
> -  - regulators
>  
>  additionalProperties: false
>  
> @@ -78,6 +98,8 @@ allOf:
>          regulators:
>            $ref: /schemas/regulator/samsung,s2mps11.yaml
>          samsung,s2mps11-wrstbi-ground: false
> +      required:
> +        - regulators
>  
>    - if:
>        properties:
> @@ -89,6 +111,8 @@ allOf:
>          regulators:
>            $ref: /schemas/regulator/samsung,s2mps13.yaml
>          samsung,s2mps11-acokb-ground: false
> +      required:
> +        - regulators
>  
>    - if:
>        properties:
> @@ -101,6 +125,8 @@ allOf:
>            $ref: /schemas/regulator/samsung,s2mps14.yaml
>          samsung,s2mps11-acokb-ground: false
>          samsung,s2mps11-wrstbi-ground: false
> +      required:
> +        - regulators
>  
>    - if:
>        properties:
> @@ -113,6 +139,8 @@ allOf:
>            $ref: /schemas/regulator/samsung,s2mps15.yaml
>          samsung,s2mps11-acokb-ground: false
>          samsung,s2mps11-wrstbi-ground: false
> +      required:
> +        - regulators
>  
>    - if:
>        properties:
> @@ -125,6 +153,8 @@ allOf:
>            $ref: /schemas/regulator/samsung,s2mpu02.yaml
>          samsung,s2mps11-acokb-ground: false
>          samsung,s2mps11-wrstbi-ground: false
> +      required:
> +        - regulators
>  
>    - if:
>        properties:
> @@ -137,6 +167,18 @@ allOf:
>            $ref: /schemas/regulator/samsung,s2mpu05.yaml
>          samsung,s2mps11-acokb-ground: false
>          samsung,s2mps11-wrstbi-ground: false
> +      required:
> +        - regulators
> +
> +  - if:
> +      properties:
> +        compatible:
> +          contains:
> +            const: samsung,s2mu005-pmic
> +    then:
> +      properties:
> +        samsung,s2mps11-acokb-ground: false
> +        samsung,s2mps11-wrstbi-ground: false
>  
>  examples:
>    - |
> @@ -278,3 +320,80 @@ examples:
>              };
>          };
>      };
> +
> +  - |
> +    #include <dt-bindings/interrupt-controller/irq.h>
> +    #include <dt-bindings/leds/common.h>
> +
> +    i2c {
> +        #address-cells = <1>;
> +        #size-cells = <0>;
> +
> +        pmic@3d {
> +            compatible = "samsung,s2mu005-pmic";
> +            reg = <0x3d>;
> +            interrupt-parent = <&gpa2>;
> +            interrupts = <7 IRQ_TYPE_LEVEL_LOW>;
> +
> +            charger {
> +                compatible = "samsung,s2mu005-charger";
> +                monitored-battery = <&battery>;
> +
> +                port {
> +                    charger_to_muic: endpoint {
> +                        remote-endpoint = <&muic_to_charger>;

graph between own nodes is pointless.

Best regards,
Krzysztof


^ permalink raw reply

* Re: [PATCH RFC 05/11] riscv: cpufeature: Add Sdtrig optional CSRs checks
From: Zane Leung @ 2026-04-15  7:05 UTC (permalink / raw)
  To: Max Hsu, Conor Dooley, Rob Herring, Krzysztof Kozlowski,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Rafael J. Wysocki,
	Pavel Machek, Anup Patel, Atish Patra, Paolo Bonzini, Shuah Khan
  Cc: Palmer Dabbelt, linux-riscv, devicetree, linux-kernel, linux-pm,
	kvm, kvm-riscv, linux-kselftest
In-Reply-To: <20240329-dev-maxh-lin-452-6-9-v1-5-1534f93b94a7@sifive.com>


On 3/29/2024 5:26 PM, Max Hsu wrote:
> Sdtrig extension introduce two optional CSRs [hcontext/scontext],
> that will be storing PID/Guest OS ID for the debug feature.
>
> The availability of these two CSRs will be determined by
> DTS and Smstateen extension [h/s]stateen0 CSR bit 57.
>
> If all CPUs hcontext/scontext checks are satisfied, it will enable the
> use_hcontext/use_scontext static branch.
>
> Signed-off-by: Max Hsu <max.hsu@sifive.com>
> ---
>  arch/riscv/include/asm/switch_to.h |   6 ++
>  arch/riscv/kernel/cpufeature.c     | 161 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 167 insertions(+)
>
> diff --git a/arch/riscv/include/asm/switch_to.h b/arch/riscv/include/asm/switch_to.h
> index 7efdb0584d47..07432550ed54 100644
> --- a/arch/riscv/include/asm/switch_to.h
> +++ b/arch/riscv/include/asm/switch_to.h
> @@ -69,6 +69,12 @@ static __always_inline bool has_fpu(void) { return false; }
>  #define __switch_to_fpu(__prev, __next) do { } while (0)
>  #endif
>  
> +DECLARE_STATIC_KEY_FALSE(use_scontext);
> +static __always_inline bool has_scontext(void)
> +{
> +	return static_branch_likely(&use_scontext);
> +}
> +
>  extern struct task_struct *__switch_to(struct task_struct *,
>  				       struct task_struct *);
>  
> diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
> index 080c06b76f53..44ff84b920af 100644
> --- a/arch/riscv/kernel/cpufeature.c
> +++ b/arch/riscv/kernel/cpufeature.c
> @@ -35,6 +35,19 @@ static DECLARE_BITMAP(riscv_isa, RISCV_ISA_EXT_MAX) __read_mostly;
>  /* Per-cpu ISA extensions. */
>  struct riscv_isainfo hart_isa[NR_CPUS];
>  
> +atomic_t hcontext_disable;
> +atomic_t scontext_disable;
> +
> +DEFINE_STATIC_KEY_FALSE_RO(use_hcontext);
> +EXPORT_SYMBOL(use_hcontext);
> +
> +DEFINE_STATIC_KEY_FALSE_RO(use_scontext);
> +EXPORT_SYMBOL(use_scontext);
> +
> +/* Record the maximum number that the hcontext CSR allowed to hold */
> +atomic_long_t hcontext_id_share;
> +EXPORT_SYMBOL(hcontext_id_share);
> +
>  /**
>   * riscv_isa_extension_base() - Get base extension word
>   *
> @@ -719,6 +732,154 @@ unsigned long riscv_get_elf_hwcap(void)
>  	return hwcap;
>  }
>  
> +static void __init sdtrig_percpu_csrs_check(void *data)
> +{
> +	struct device_node *node;
> +	struct device_node *debug_node;
> +	struct device_node *trigger_module;
> +
> +	unsigned int cpu = smp_processor_id();
> +
> +	/*
> +	 * Expect every cpu node has the [h/s]context-present property
> +	 * otherwise, jump to sdtrig_csrs_disable_all to disable all access to
> +	 * [h/s]context CSRs
> +	 */
> +	node = of_cpu_device_node_get(cpu);
> +	if (!node)
> +		goto sdtrig_csrs_disable_all;
> +
> +	debug_node = of_get_compatible_child(node, "riscv,debug-v1.0.0");
> +	of_node_put(node);
> +
> +	if (!debug_node)
> +		goto sdtrig_csrs_disable_all;
> +
> +	trigger_module = of_get_child_by_name(debug_node, "trigger-module");
> +	of_node_put(debug_node);
> +
> +	if (!trigger_module)
> +		goto sdtrig_csrs_disable_all;
> +
> +	if (!(IS_ENABLED(CONFIG_KVM) &&
> +	      of_property_read_bool(trigger_module, "hcontext-present")))
> +		atomic_inc(&hcontext_disable);
> +
> +	if (!of_property_read_bool(trigger_module, "scontext-present"))
> +		atomic_inc(&scontext_disable);
> +
> +	of_node_put(trigger_module);
> +
> +	/*
> +	 * Before access to hcontext/scontext CSRs, if the smstateen
> +	 * extension is present, the accessibility will be controlled
> +	 * by the hstateen0[H]/sstateen0 CSRs.
> +	 */
> +	if (__riscv_isa_extension_available(NULL, RISCV_ISA_EXT_SMSTATEEN)) {
> +		u64 hstateen_bit, sstateen_bit;
> +
> +		if (__riscv_isa_extension_available(NULL, RISCV_ISA_EXT_h)) {
> +#if __riscv_xlen > 32
> +			csr_set(CSR_HSTATEEN0, SMSTATEEN0_HSCONTEXT);
> +			hstateen_bit = csr_read(CSR_HSTATEEN0);
> +#else
> +			csr_set(CSR_HSTATEEN0H, SMSTATEEN0_HSCONTEXT >> 32);
> +			hstateen_bit = csr_read(CSR_HSTATEEN0H) << 32;
> +#endif
> +			if (!(hstateen_bit & SMSTATEEN0_HSCONTEXT))
> +				goto sdtrig_csrs_disable_all;
> +
> +		} else {
> +			if (IS_ENABLED(CONFIG_KVM))
> +				atomic_inc(&hcontext_disable);
> +
> +			/*
> +			 * In RV32, the smstateen extension doesn't provide
> +			 * high 32 bits of sstateen0 CSR which represent
> +			 * accessibility for scontext CSR;
> +			 * The decision is left on whether the dts has the
> +			 * property to access the scontext CSR.
> +			 */
> +#if __riscv_xlen > 32
> +			csr_set(CSR_SSTATEEN0, SMSTATEEN0_HSCONTEXT);
> +			sstateen_bit = csr_read(CSR_SSTATEEN0);
> +
> +			if (!(sstateen_bit & SMSTATEEN0_HSCONTEXT))
> +				atomic_inc(&scontext_disable);
> +#endif
For the supervisor-level sstateen registers, high-half CSRs are not added at this time because
it is expected the upper 32 bits of these registers will always be zeros. see:
https://github.com/riscv/riscv-isa-manual/blob/dca12d638b140d86441ad0b067997c70d2017017/src/priv/smstateen.adoc#L71-L7


> +		}
> +	}
> +
> +	/*
> +	 * The code can only access hcontext/scontext CSRs if:
> +	 * The cpu dts node have [h/s]context-present;
> +	 * If Smstateen extension is presented, then the accessibility bit
> +	 * toward hcontext/scontext CSRs is enabled; Or the Smstateen extension
> +	 * isn't available, thus the access won't be blocked by it.
> +	 *
> +	 * With writing 1 to the every bit of these CSRs, we retrieve the
> +	 * maximum bits that is available on the CSRs. and decide
> +	 * whether it's suit for its context recording operation.
> +	 */
> +	if (IS_ENABLED(CONFIG_KVM) &&
> +	    !atomic_read(&hcontext_disable)) {
> +		unsigned long hcontext_available_bits = 0;
> +
> +		csr_write(CSR_HCONTEXT, -1UL);
> +		hcontext_available_bits = csr_swap(CSR_HCONTEXT, hcontext_available_bits);
> +
> +		/* hcontext CSR is required by at least 1 bit */
> +		if (hcontext_available_bits)
> +			atomic_long_and(hcontext_available_bits, &hcontext_id_share);
> +		else
> +			atomic_inc(&hcontext_disable);
> +	}
> +
> +	if (!atomic_read(&scontext_disable)) {
> +		unsigned long scontext_available_bits = 0;
> +
> +		csr_write(CSR_SCONTEXT, -1UL);
> +		scontext_available_bits = csr_swap(CSR_SCONTEXT, scontext_available_bits);
> +
> +		/* scontext CSR is required by at least the sizeof pid_t */
> +		if (scontext_available_bits < ((1UL << (sizeof(pid_t) << 3)) - 1))
> +			atomic_inc(&scontext_disable);
> +	}
> +
> +	return;
> +
> +sdtrig_csrs_disable_all:
> +	if (IS_ENABLED(CONFIG_KVM))
> +		atomic_inc(&hcontext_disable);
> +
> +	atomic_inc(&scontext_disable);
> +}
> +
> +static int __init sdtrig_enable_csrs_fill(void)
> +{
> +	if (__riscv_isa_extension_available(NULL, RISCV_ISA_EXT_SDTRIG)) {
> +		atomic_long_set(&hcontext_id_share, -1UL);
> +
> +		/* check every CPUs sdtrig extension optional CSRs */
> +		sdtrig_percpu_csrs_check(NULL);
> +		smp_call_function(sdtrig_percpu_csrs_check, NULL, 1);
> +
> +		if (IS_ENABLED(CONFIG_KVM) &&
> +		    !atomic_read(&hcontext_disable)) {
> +			pr_info("riscv-sdtrig: Writing 'GuestOS ID' to hcontext CSR is enabled\n");
> +			static_branch_enable(&use_hcontext);
> +		}
> +
> +		if (!atomic_read(&scontext_disable)) {
> +			pr_info("riscv-sdtrig: Writing 'PID' to scontext CSR is enabled\n");
> +			static_branch_enable(&use_scontext);
> +		}
> +	}
> +	return 0;
> +}
> +
> +arch_initcall(sdtrig_enable_csrs_fill);
> +
>  void riscv_user_isa_enable(void)
>  {
>  	if (riscv_cpu_has_extension_unlikely(smp_processor_id(), RISCV_ISA_EXT_ZICBOZ))
>

^ permalink raw reply

* Re: [PATCH v4 02/13] dt-bindings: leds: document Samsung S2M series PMIC RGB LED device
From: Krzysztof Kozlowski @ 2026-04-15  7:03 UTC (permalink / raw)
  To: Kaustabh Chakraborty
  Cc: Lee Jones, Pavel Machek, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, MyungJoo Ham, Chanwoo Choi, Sebastian Reichel,
	André Draszik, Alexandre Belloni, Jonathan Corbet,
	Shuah Khan, Nam Tran, Łukasz Lebiedziński, linux-leds,
	devicetree, linux-kernel, linux-pm, linux-samsung-soc, linux-rtc,
	linux-doc
In-Reply-To: <20260414-s2mu005-pmic-v4-2-7fe7480577e6@disroot.org>

On Tue, Apr 14, 2026 at 12:02:54PM +0530, Kaustabh Chakraborty wrote:
> +description: |
> +  The Samsung S2M series PMIC RGB LED is a three-channel LED device with
> +  8-bit brightness control for each channel, typically used as status
> +  indicators in mobile phones.
> +
> +  This is a part of device tree bindings for S2M and S5M family of Power
> +  Management IC (PMIC).
> +
> +  See also Documentation/devicetree/bindings/mfd/samsung,s2mps11.yaml for
> +  additional information and example.
> +
> +allOf:
> +  - $ref: common.yaml#

Rob's comment is still valid:
1. How do you address one of three LEDs in non-RGB case?
2. Where is multi-color?

And based on this alone without other properties, I say this should be
part of top-level schema.  Separate node is fine, but no need for
separate binding.

Best regards,
Krzysztof


^ permalink raw reply

* [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: chensong_2000 @ 2026-04-15  7:01 UTC (permalink / raw)
  To: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers
  Cc: linux-modules, linux-kernel, linux-trace-kernel, linux-acpi,
	linux-clk, linux-pm, live-patching, dm-devel, linux-raid,
	kgdb-bugreport, netdev, Song Chen

From: Song Chen <chensong_2000@189.cn>

The current notifier chain implementation uses a single-linked list
(struct notifier_block *next), which only supports forward traversal
in priority order. This makes it difficult to handle cleanup/teardown
scenarios that require notifiers to be called in reverse priority order.

A concrete example is the ordering dependency between ftrace and
livepatch during module load/unload. see the detail here [1].

This patch replaces the single-linked list in struct notifier_block
with a struct list_head, converting the notifier chain into a
doubly-linked list sorted in descending priority order. Based on
this, a new function notifier_call_chain_reverse() is introduced,
which traverses the chain in reverse (ascending priority order).
The corresponding blocking_notifier_call_chain_reverse() is also
added as the locking wrapper for blocking notifier chains.

The internal notifier_call_chain_robust() is updated to use
notifier_call_chain_reverse() for rollback: on error, it records
the failing notifier (last_nb) and the count of successfully called
notifiers (nr), then rolls back exactly those nr-1 notifiers in
reverse order starting from last_nb's predecessor, without needing
to know the total length of the chain.

With this change, subsystems with symmetric setup/teardown ordering
requirements can register a single notifier_block with one priority
value, and rely on blocking_notifier_call_chain() for forward
traversal and blocking_notifier_call_chain_reverse() for reverse
traversal, without needing hard-coded call sequences or separate
notifier registrations for each direction.

[1]:https://lore.kernel.org/all
	/alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/

Signed-off-by: Song Chen <chensong_2000@189.cn>
---
 drivers/acpi/sleep.c      |   1 -
 drivers/clk/clk.c         |   2 +-
 drivers/cpufreq/cpufreq.c |   2 +-
 drivers/md/dm-integrity.c |   1 -
 drivers/md/md.c           |   1 -
 include/linux/notifier.h  |  26 ++---
 kernel/debug/debug_core.c |   1 -
 kernel/notifier.c         | 219 ++++++++++++++++++++++++++++++++------
 net/ipv4/nexthop.c        |   2 +-
 9 files changed, 201 insertions(+), 54 deletions(-)

diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 132a9df98471..b776dbd5a382 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -56,7 +56,6 @@ static int tts_notify_reboot(struct notifier_block *this,
 
 static struct notifier_block tts_notifier = {
 	.notifier_call	= tts_notify_reboot,
-	.next		= NULL,
 	.priority	= 0,
 };
 
diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
index 47093cda9df3..b6fe380d0468 100644
--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -4862,7 +4862,7 @@ int clk_notifier_unregister(struct clk *clk, struct notifier_block *nb)
 			clk->core->notifier_count--;
 
 			/* XXX the notifier code should handle this better */
-			if (!cn->notifier_head.head) {
+			if (list_empty(&cn->notifier_head.head)) {
 				srcu_cleanup_notifier_head(&cn->notifier_head);
 				list_del(&cn->node);
 				kfree(cn);
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 277884d91913..12637e742ffa 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -445,7 +445,7 @@ static void cpufreq_list_transition_notifiers(void)
 
 	mutex_lock(&cpufreq_transition_notifier_list.mutex);
 
-	for (nb = cpufreq_transition_notifier_list.head; nb; nb = nb->next)
+	list_for_each_entry(nb, &cpufreq_transition_notifier_list.head, entry)
 		pr_info("%pS\n", nb->notifier_call);
 
 	mutex_unlock(&cpufreq_transition_notifier_list.mutex);
diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index 06e805902151..ccdf75c40b62 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -3909,7 +3909,6 @@ static void dm_integrity_resume(struct dm_target *ti)
 	}
 
 	ic->reboot_notifier.notifier_call = dm_integrity_reboot;
-	ic->reboot_notifier.next = NULL;
 	ic->reboot_notifier.priority = INT_MAX - 1;	/* be notified after md and before hardware drivers */
 	WARN_ON(register_reboot_notifier(&ic->reboot_notifier));
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3ce6f9e9d38e..8249e78636ab 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -10480,7 +10480,6 @@ static int md_notify_reboot(struct notifier_block *this,
 
 static struct notifier_block md_notifier = {
 	.notifier_call	= md_notify_reboot,
-	.next		= NULL,
 	.priority	= INT_MAX, /* before any real devices */
 };
 
diff --git a/include/linux/notifier.h b/include/linux/notifier.h
index 01b6c9d9956f..b2abbdfcaadd 100644
--- a/include/linux/notifier.h
+++ b/include/linux/notifier.h
@@ -53,41 +53,41 @@ typedef	int (*notifier_fn_t)(struct notifier_block *nb,
 
 struct notifier_block {
 	notifier_fn_t notifier_call;
-	struct notifier_block __rcu *next;
+	struct list_head __rcu entry;
 	int priority;
 };
 
 struct atomic_notifier_head {
 	spinlock_t lock;
-	struct notifier_block __rcu *head;
+	struct list_head __rcu head;
 };
 
 struct blocking_notifier_head {
 	struct rw_semaphore rwsem;
-	struct notifier_block __rcu *head;
+	struct list_head __rcu head;
 };
 
 struct raw_notifier_head {
-	struct notifier_block __rcu *head;
+	struct list_head __rcu head;
 };
 
 struct srcu_notifier_head {
 	struct mutex mutex;
 	struct srcu_usage srcuu;
 	struct srcu_struct srcu;
-	struct notifier_block __rcu *head;
+	struct list_head __rcu head;
 };
 
 #define ATOMIC_INIT_NOTIFIER_HEAD(name) do {	\
 		spin_lock_init(&(name)->lock);	\
-		(name)->head = NULL;		\
+		INIT_LIST_HEAD(&(name)->head);		\
 	} while (0)
 #define BLOCKING_INIT_NOTIFIER_HEAD(name) do {	\
 		init_rwsem(&(name)->rwsem);	\
-		(name)->head = NULL;		\
+		INIT_LIST_HEAD(&(name)->head);		\
 	} while (0)
 #define RAW_INIT_NOTIFIER_HEAD(name) do {	\
-		(name)->head = NULL;		\
+		INIT_LIST_HEAD(&(name)->head);		\
 	} while (0)
 
 /* srcu_notifier_heads must be cleaned up dynamically */
@@ -97,17 +97,17 @@ extern void srcu_init_notifier_head(struct srcu_notifier_head *nh);
 
 #define ATOMIC_NOTIFIER_INIT(name) {				\
 		.lock = __SPIN_LOCK_UNLOCKED(name.lock),	\
-		.head = NULL }
+		.head = LIST_HEAD_INIT((name).head) }
 #define BLOCKING_NOTIFIER_INIT(name) {				\
 		.rwsem = __RWSEM_INITIALIZER((name).rwsem),	\
-		.head = NULL }
+		.head = LIST_HEAD_INIT((name).head) }
 #define RAW_NOTIFIER_INIT(name)	{				\
-		.head = NULL }
+		.head = LIST_HEAD_INIT((name).head) }
 
 #define SRCU_NOTIFIER_INIT(name, pcpu)				\
 	{							\
 		.mutex = __MUTEX_INITIALIZER(name.mutex),	\
-		.head = NULL,					\
+		.head = LIST_HEAD_INIT((name).head),					\
 		.srcuu = __SRCU_USAGE_INIT(name.srcuu),		\
 		.srcu = __SRCU_STRUCT_INIT(name.srcu, name.srcuu, pcpu, 0), \
 	}
@@ -170,6 +170,8 @@ extern int atomic_notifier_call_chain(struct atomic_notifier_head *nh,
 		unsigned long val, void *v);
 extern int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
 		unsigned long val, void *v);
+extern int blocking_notifier_call_chain_reverse(struct blocking_notifier_head *nh,
+		unsigned long val, void *v);
 extern int raw_notifier_call_chain(struct raw_notifier_head *nh,
 		unsigned long val, void *v);
 extern int srcu_notifier_call_chain(struct srcu_notifier_head *nh,
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index 0b9495187fba..a26a7683d142 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -1054,7 +1054,6 @@ dbg_notify_reboot(struct notifier_block *this, unsigned long code, void *x)
 
 static struct notifier_block dbg_reboot_notifier = {
 	.notifier_call		= dbg_notify_reboot,
-	.next			= NULL,
 	.priority		= INT_MAX,
 };
 
diff --git a/kernel/notifier.c b/kernel/notifier.c
index 2f9fe7c30287..6f4d887771c4 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -14,39 +14,47 @@
  *	are layered on top of these, with appropriate locking added.
  */
 
-static int notifier_chain_register(struct notifier_block **nl,
+static int notifier_chain_register(struct list_head *nl,
 				   struct notifier_block *n,
 				   bool unique_priority)
 {
-	while ((*nl) != NULL) {
-		if (unlikely((*nl) == n)) {
+	struct notifier_block *cur;
+
+	list_for_each_entry(cur, nl, entry) {
+		if (unlikely(cur == n)) {
 			WARN(1, "notifier callback %ps already registered",
 			     n->notifier_call);
 			return -EEXIST;
 		}
-		if (n->priority > (*nl)->priority)
-			break;
-		if (n->priority == (*nl)->priority && unique_priority)
+
+		if (n->priority == cur->priority && unique_priority)
 			return -EBUSY;
-		nl = &((*nl)->next);
+
+		if (n->priority > cur->priority) {
+			list_add_tail(&n->entry, &cur->entry);
+			goto out;
+		}
 	}
-	n->next = *nl;
-	rcu_assign_pointer(*nl, n);
+
+	list_add_tail(&n->entry, nl);
+out:
 	trace_notifier_register((void *)n->notifier_call);
 	return 0;
 }
 
-static int notifier_chain_unregister(struct notifier_block **nl,
+static int notifier_chain_unregister(struct list_head *nl,
 		struct notifier_block *n)
 {
-	while ((*nl) != NULL) {
-		if ((*nl) == n) {
-			rcu_assign_pointer(*nl, n->next);
+	struct notifier_block *cur;
+
+	list_for_each_entry(cur, nl, entry) {
+		if (cur == n) {
+			list_del(&n->entry);
 			trace_notifier_unregister((void *)n->notifier_call);
 			return 0;
 		}
-		nl = &((*nl)->next);
 	}
+
 	return -ENOENT;
 }
 
@@ -59,25 +67,25 @@ static int notifier_chain_unregister(struct notifier_block **nl,
  *			value of this parameter is -1.
  *	@nr_calls:	Records the number of notifications sent. Don't care
  *			value of this field is NULL.
+ *	@last_nb:  Records the last called notifier block for rolling back
  *	Return:		notifier_call_chain returns the value returned by the
  *			last notifier function called.
  */
-static int notifier_call_chain(struct notifier_block **nl,
+static int notifier_call_chain(struct list_head *nl,
 			       unsigned long val, void *v,
-			       int nr_to_call, int *nr_calls)
+			       int nr_to_call, int *nr_calls,
+				   struct notifier_block **last_nb)
 {
 	int ret = NOTIFY_DONE;
-	struct notifier_block *nb, *next_nb;
-
-	nb = rcu_dereference_raw(*nl);
+	struct notifier_block *nb;
 
-	while (nb && nr_to_call) {
-		next_nb = rcu_dereference_raw(nb->next);
+	if (!nr_to_call)
+		return ret;
 
+	list_for_each_entry(nb, nl, entry) {
 #ifdef CONFIG_DEBUG_NOTIFIERS
 		if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
 			WARN(1, "Invalid notifier called!");
-			nb = next_nb;
 			continue;
 		}
 #endif
@@ -87,15 +95,118 @@ static int notifier_call_chain(struct notifier_block **nl,
 		if (nr_calls)
 			(*nr_calls)++;
 
+		if (last_nb)
+			*last_nb = nb;
+
 		if (ret & NOTIFY_STOP_MASK)
 			break;
-		nb = next_nb;
-		nr_to_call--;
+
+		if (nr_to_call-- == 0)
+			break;
 	}
 	return ret;
 }
 NOKPROBE_SYMBOL(notifier_call_chain);
 
+/**
+ * notifier_call_chain_reverse - Informs the registered notifiers
+ *			about an event reversely.
+ *	@nl:		Pointer to head of the blocking notifier chain
+ *	@val:		Value passed unmodified to notifier function
+ *	@v:		Pointer passed unmodified to notifier function
+ *	@nr_to_call:	Number of notifier functions to be called. Don't care
+ *			value of this parameter is -1.
+ *	@nr_calls:	Records the number of notifications sent. Don't care
+ *			value of this field is NULL.
+ *	Return:		notifier_call_chain returns the value returned by the
+ *			last notifier function called.
+ */
+static int notifier_call_chain_reverse(struct list_head *nl,
+					struct notifier_block *start,
+					unsigned long val, void *v,
+					int nr_to_call, int *nr_calls)
+{
+	int ret = NOTIFY_DONE;
+	struct notifier_block *nb;
+	bool do_call = (start == NULL);
+
+	if (!nr_to_call)
+		return ret;
+
+	list_for_each_entry_reverse(nb, nl, entry) {
+		if (!do_call) {
+			if (nb == start)
+				do_call = true;
+			continue;
+		}
+#ifdef CONFIG_DEBUG_NOTIFIERS
+		if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
+			WARN(1, "Invalid notifier called!");
+			continue;
+		}
+#endif
+		trace_notifier_run((void *)nb->notifier_call);
+		ret = nb->notifier_call(nb, val, v);
+
+		if (nr_calls)
+			(*nr_calls)++;
+
+		if (ret & NOTIFY_STOP_MASK)
+			break;
+
+		if (nr_to_call-- == 0)
+			break;
+	}
+	return ret;
+}
+NOKPROBE_SYMBOL(notifier_call_chain_reverse);
+
+/**
+ * notifier_call_chain_rcu - Informs the registered notifiers
+ *			about an event for srcu notifier chain.
+ *	@nl:		Pointer to head of the blocking notifier chain
+ *	@val:		Value passed unmodified to notifier function
+ *	@v:		Pointer passed unmodified to notifier function
+ *	@nr_to_call:	Number of notifier functions to be called. Don't care
+ *			value of this parameter is -1.
+ *	@nr_calls:	Records the number of notifications sent. Don't care
+ *			value of this field is NULL.
+ *	Return:		notifier_call_chain returns the value returned by the
+ *			last notifier function called.
+ */
+static int notifier_call_chain_rcu(struct list_head *nl,
+			       unsigned long val, void *v,
+			       int nr_to_call, int *nr_calls)
+{
+	int ret = NOTIFY_DONE;
+	struct notifier_block *nb;
+
+	if (!nr_to_call)
+		return ret;
+
+	list_for_each_entry_rcu(nb, nl, entry) {
+#ifdef CONFIG_DEBUG_NOTIFIERS
+		if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
+			WARN(1, "Invalid notifier called!");
+			continue;
+		}
+#endif
+		trace_notifier_run((void *)nb->notifier_call);
+		ret = nb->notifier_call(nb, val, v);
+
+		if (nr_calls)
+			(*nr_calls)++;
+
+		if (ret & NOTIFY_STOP_MASK)
+			break;
+
+		if (nr_to_call-- == 0)
+			break;
+	}
+	return ret;
+}
+NOKPROBE_SYMBOL(notifier_call_chain_rcu);
+
 /**
  * notifier_call_chain_robust - Inform the registered notifiers about an event
  *                              and rollback on error.
@@ -111,15 +222,16 @@ NOKPROBE_SYMBOL(notifier_call_chain);
  *
  * Return:	the return value of the @val_up call.
  */
-static int notifier_call_chain_robust(struct notifier_block **nl,
+static int notifier_call_chain_robust(struct list_head *nl,
 				     unsigned long val_up, unsigned long val_down,
 				     void *v)
 {
 	int ret, nr = 0;
+	struct notifier_block *last_nb = NULL;
 
-	ret = notifier_call_chain(nl, val_up, v, -1, &nr);
+	ret = notifier_call_chain(nl, val_up, v, -1, &nr, &last_nb);
 	if (ret & NOTIFY_STOP_MASK)
-		notifier_call_chain(nl, val_down, v, nr-1, NULL);
+		notifier_call_chain_reverse(nl, last_nb, val_down, v, nr-1, NULL);
 
 	return ret;
 }
@@ -220,7 +332,7 @@ int atomic_notifier_call_chain(struct atomic_notifier_head *nh,
 	int ret;
 
 	rcu_read_lock();
-	ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
+	ret = notifier_call_chain(&nh->head, val, v, -1, NULL, NULL);
 	rcu_read_unlock();
 
 	return ret;
@@ -238,7 +350,7 @@ NOKPROBE_SYMBOL(atomic_notifier_call_chain);
  */
 bool atomic_notifier_call_chain_is_empty(struct atomic_notifier_head *nh)
 {
-	return !rcu_access_pointer(nh->head);
+	return list_empty(&nh->head);
 }
 
 /*
@@ -340,7 +452,7 @@ int blocking_notifier_call_chain_robust(struct blocking_notifier_head *nh,
 	 * racy then it does not matter what the result of the test
 	 * is, we re-check the list after having taken the lock anyway:
 	 */
-	if (rcu_access_pointer(nh->head)) {
+	if (!list_empty(&nh->head)) {
 		down_read(&nh->rwsem);
 		ret = notifier_call_chain_robust(&nh->head, val_up, val_down, v);
 		up_read(&nh->rwsem);
@@ -375,15 +487,52 @@ int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
 	 * racy then it does not matter what the result of the test
 	 * is, we re-check the list after having taken the lock anyway:
 	 */
-	if (rcu_access_pointer(nh->head)) {
+	if (!list_empty(&nh->head)) {
 		down_read(&nh->rwsem);
-		ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
+		ret = notifier_call_chain(&nh->head, val, v, -1, NULL, NULL);
 		up_read(&nh->rwsem);
 	}
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blocking_notifier_call_chain);
 
+/**
+ *	blocking_notifier_call_chain_reverse - Call functions reversely in
+ *				a blocking notifier chain
+ *	@nh: Pointer to head of the blocking notifier chain
+ *	@val: Value passed unmodified to notifier function
+ *	@v: Pointer passed unmodified to notifier function
+ *
+ *	Calls each function in a notifier chain in turn.  The functions
+ *	run in a process context, so they are allowed to block.
+ *
+ *	If the return value of the notifier can be and'ed
+ *	with %NOTIFY_STOP_MASK then blocking_notifier_call_chain()
+ *	will return immediately, with the return value of
+ *	the notifier function which halted execution.
+ *	Otherwise the return value is the return value
+ *	of the last notifier function called.
+ */
+
+int blocking_notifier_call_chain_reverse(struct blocking_notifier_head *nh,
+		unsigned long val, void *v)
+{
+	int ret = NOTIFY_DONE;
+
+	/*
+	 * We check the head outside the lock, but if this access is
+	 * racy then it does not matter what the result of the test
+	 * is, we re-check the list after having taken the lock anyway:
+	 */
+	if (!list_empty(&nh->head)) {
+		down_read(&nh->rwsem);
+		ret = notifier_call_chain_reverse(&nh->head, NULL, val, v, -1, NULL);
+		up_read(&nh->rwsem);
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blocking_notifier_call_chain_reverse);
+
 /*
  *	Raw notifier chain routines.  There is no protection;
  *	the caller must provide it.  Use at your own risk!
@@ -450,7 +599,7 @@ EXPORT_SYMBOL_GPL(raw_notifier_call_chain_robust);
 int raw_notifier_call_chain(struct raw_notifier_head *nh,
 		unsigned long val, void *v)
 {
-	return notifier_call_chain(&nh->head, val, v, -1, NULL);
+	return notifier_call_chain(&nh->head, val, v, -1, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(raw_notifier_call_chain);
 
@@ -543,7 +692,7 @@ int srcu_notifier_call_chain(struct srcu_notifier_head *nh,
 	int idx;
 
 	idx = srcu_read_lock(&nh->srcu);
-	ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
+	ret = notifier_call_chain_rcu(&nh->head, val, v, -1, NULL);
 	srcu_read_unlock(&nh->srcu, idx);
 	return ret;
 }
@@ -566,7 +715,7 @@ void srcu_init_notifier_head(struct srcu_notifier_head *nh)
 	mutex_init(&nh->mutex);
 	if (init_srcu_struct(&nh->srcu) < 0)
 		BUG();
-	nh->head = NULL;
+	INIT_LIST_HEAD(&nh->head);
 }
 EXPORT_SYMBOL_GPL(srcu_init_notifier_head);
 
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index c942f1282236..0afcba2967c7 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -90,7 +90,7 @@ static const struct nla_policy rtm_nh_res_bucket_policy_get[] = {
 
 static bool nexthop_notifiers_is_empty(struct net *net)
 {
-	return !net->nexthop.notifier_chain.head;
+	return list_empty(&net->nexthop.notifier_chain.head);
 }
 
 static void
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2] pmdomain: imx: Make IMX8M/IMX9 BLK_CTRL tristate
From: Frank Li @ 2026-04-15  6:58 UTC (permalink / raw)
  To: Zhipeng Wang
  Cc: ulfh, s.hauer, kernel, festevam, linux-pm, imx, linux-arm-kernel,
	linux-kernel, xuegang.liu, jindong.yue
In-Reply-To: <20260413053049.3041177-1-zhipeng.wang_1@nxp.com>

On Mon, Apr 13, 2026 at 02:30:49PM +0900, Zhipeng Wang wrote:
> Convert IMX8M_BLK_CTRL and IMX9_BLK_CTRL from bool to tristate
> to allow building as loadable modules.
>
> Add prompt strings to make these options visible and configurable
> in menuconfig, keeping them enabled by default on appropriate platforms.
>
> Also remove the IMX_GPCV2_PM_DOMAINS dependency from IMX9_BLK_CTRL.
> This dependency was incorrect from the beginning - i.MX93 uses a

s/-/because

Reviewed-by: Frank Li <Frank.Li@nxp.com>

> different power domain architecture compared to i.MX8M series:
>
> - i.MX8M uses GPCv2 (General Power Controller v2) for power domain
>   management, hence IMX8M_BLK_CTRL correctly depends on it.
>
> - i.MX93 uses BLK_CTRL directly without GPCv2. The hardware doesn't
>   have GPCv2 at all.
>
> Signed-off-by: Zhipeng Wang <zhipeng.wang_1@nxp.com>
> ---
>  drivers/pmdomain/imx/Kconfig | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pmdomain/imx/Kconfig b/drivers/pmdomain/imx/Kconfig
> index 00203615c65e..9168d183b0c5 100644
> --- a/drivers/pmdomain/imx/Kconfig
> +++ b/drivers/pmdomain/imx/Kconfig
> @@ -10,15 +10,18 @@ config IMX_GPCV2_PM_DOMAINS
>  	default y if SOC_IMX7D
>
>  config IMX8M_BLK_CTRL
> -	bool
> -	default SOC_IMX8M && IMX_GPCV2_PM_DOMAINS
> +	tristate "i.MX8M BLK CTRL driver"
> +	depends on SOC_IMX8M
> +	depends on IMX_GPCV2_PM_DOMAINS
>  	depends on PM_GENERIC_DOMAINS
>  	depends on COMMON_CLK
> +	default y
>
>  config IMX9_BLK_CTRL
> -	bool
> -	default SOC_IMX9 && IMX_GPCV2_PM_DOMAINS
> +	tristate "i.MX93 BLK CTRL driver"
> +	depends on SOC_IMX9
>  	depends on PM_GENERIC_DOMAINS
> +	default y
>
>  config IMX_SCU_PD
>  	bool "IMX SCU Power Domain driver"
> --
> 2.34.1
>

^ permalink raw reply

* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Song Chen @ 2026-04-15  6:43 UTC (permalink / raw)
  To: Petr Pavlu
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe, jikos, mbenes,
	pmladek, joe.lawrence, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <1191caf5-6a61-4622-a15e-854d3701f4fc@suse.com>

Hi,

On 4/14/26 22:33, Petr Pavlu wrote:
> On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
>> From: Song Chen <chensong_2000@189.cn>
>>
>> ftrace and livepatch currently have their module load/unload callbacks
>> hard-coded in the module loader as direct function calls to
>> ftrace_module_enable(), klp_module_coming(), klp_module_going()
>> and ftrace_release_mod(). This tight coupling was originally introduced
>> to enforce strict call ordering that could not be guaranteed by the
>> module notifier chain, which only supported forward traversal. Their
>> notifiers were moved in and out back and forth. see [1] and [2].
> 
> I'm unclear about what is meant by the notifiers being moved back and
> forth. The links point to patches that converted ftrace+klp from using
> module notifiers to explicit callbacks due to ordering issues, but this
> switch occurred only once. Have there been other attempts to use
> notifiers again?
> 

Yes,only once,i will rephrase.

>>
>> Now that the notifier chain supports reverse traversal via
>> blocking_notifier_call_chain_reverse(), the ordering can be enforced
>> purely through notifier priority. As a result, the module loader is now
>> decoupled from the implementation details of ftrace and livepatch.
>> What's more, adding a new subsystem with symmetric setup/teardown ordering
>> requirements during module load/unload no longer requires modifying
>> kernel/module/main.c; it only needs to register a notifier_block with an
>> appropriate priority.
>>
>> [1]:https://lore.kernel.org/all/
>> 	alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/
>> [2]:https://lore.kernel.org/all/
>> 	20160301030034.GC12120@packer-debian-8-amd64.digitalocean.com/
> 
> Nit: Avoid wrapping URLs, as it breaks autolinking and makes the links
> harder to copy.
> 
> Better links would be:
> [1] https://lore.kernel.org/all/1455661953-15838-1-git-send-email-jeyu@redhat.com/
> [2] https://lore.kernel.org/all/1458176139-17455-1-git-send-email-jeyu@redhat.com/
> 
> The first link is the final version of what landed as commit
> 7dcd182bec27 ("ftrace/module: remove ftrace module notifier"). The
> second is commit 7e545d6eca20 ("livepatch/module: remove livepatch
> module notifier").
> 

Thank you, i will update.

>>
>> Signed-off-by: Song Chen <chensong_2000@189.cn>
>> ---
>>   include/linux/module.h  |  8 ++++++++
>>   kernel/livepatch/core.c | 29 ++++++++++++++++++++++++++++-
>>   kernel/module/main.c    | 34 +++++++++++++++-------------------
>>   kernel/trace/ftrace.c   | 38 ++++++++++++++++++++++++++++++++++++++
>>   4 files changed, 89 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/module.h b/include/linux/module.h
>> index 14f391b186c6..0bdd56f9defd 100644
>> --- a/include/linux/module.h
>> +++ b/include/linux/module.h
>> @@ -308,6 +308,14 @@ enum module_state {
>>   	MODULE_STATE_COMING,	/* Full formed, running module_init. */
>>   	MODULE_STATE_GOING,	/* Going away. */
>>   	MODULE_STATE_UNFORMED,	/* Still setting it up. */
>> +	MODULE_STATE_FORMED,
> 
> I don't see a reason to add a new module state. Why is it necessary and
> how does it fit with the existing states?
> 
because once notifier fails in state MODULE_STATE_UNFORMED (now only 
ftrace has someting to do in this state), notifier chain will roll back 
by calling blocking_notifier_call_chain_robust, i'm afraid 
MODULE_STATE_GOING is going to jeopardise the notifers which don't 
handle it appropriately, like:

case MODULE_STATE_COMING:
      kmalloc();
case MODULE_STATE_GOING:
      kfree();


>> +};
>> +
>> +enum module_notifier_prio {
>> +	MODULE_NOTIFIER_PRIO_LOW = INT_MIN,	/* Low prioroty, coming last, going first */
>> +	MODULE_NOTIFIER_PRIO_MID = 0,	/* Normal priority. */
>> +	MODULE_NOTIFIER_PRIO_SECOND_HIGH = INT_MAX - 1,	/* Second high priorigy, coming second*/
>> +	MODULE_NOTIFIER_PRIO_HIGH = INT_MAX,	/* High priorigy, coming first, going late. */
> 
> I suggest being explicit about how the notifiers are ordered. For
> example:
> 
> enum module_notifier_prio {
> 	MODULE_NOTIFIER_PRIO_NORMAL,	/* Normal priority, coming last, going first. */
> 	MODULE_NOTIFIER_PRIO_LIVEPATCH,
> 	MODULE_NOTIFIER_PRIO_FTRACE,	/* High priority, coming first, going late. */
> };
> 

accepted.

>>   };
>>   
>>   struct mod_tree_node {
>> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
>> index 28d15ba58a26..ce78bb23e24b 100644
>> --- a/kernel/livepatch/core.c
>> +++ b/kernel/livepatch/core.c
>> @@ -1375,13 +1375,40 @@ void *klp_find_section_by_name(const struct module *mod, const char *name,
>>   }
>>   EXPORT_SYMBOL_GPL(klp_find_section_by_name);
>>   
>> +static int klp_module_callback(struct notifier_block *nb, unsigned long op,
>> +			void *module)
>> +{
>> +	struct module *mod = module;
>> +	int err = 0;
>> +
>> +	switch (op) {
>> +	case MODULE_STATE_COMING:
>> +		err = klp_module_coming(mod);
>> +		break;
>> +	case MODULE_STATE_LIVE:
>> +		break;
>> +	case MODULE_STATE_GOING:
>> +		klp_module_going(mod);
>> +		break;
>> +	default:
>> +		break;
>> +	}
> 
> klp_module_coming() and klp_module_going() are now used only in
> kernel/livepatch/core.c where they are also defined. This means the
> functions can be static and their declarations removed from
> include/linux/livepatch.h.
> 
> Nit: The MODULE_STATE_LIVE and default cases in the switch can be
> removed.
> 

accepted.

>> +
>> +	return notifier_from_errno(err);
>> +}
>> +
>> +static struct notifier_block klp_module_nb = {
>> +	.notifier_call = klp_module_callback,
>> +	.priority = MODULE_NOTIFIER_PRIO_SECOND_HIGH
>> +};
>> +
>>   static int __init klp_init(void)
>>   {
>>   	klp_root_kobj = kobject_create_and_add("livepatch", kernel_kobj);
>>   	if (!klp_root_kobj)
>>   		return -ENOMEM;
>>   
>> -	return 0;
>> +	return register_module_notifier(&klp_module_nb);
>>   }
>>   
>>   module_init(klp_init);
>> diff --git a/kernel/module/main.c b/kernel/module/main.c
>> index c3ce106c70af..226dd5b80997 100644
>> --- a/kernel/module/main.c
>> +++ b/kernel/module/main.c
>> @@ -833,10 +833,8 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
>>   	/* Final destruction now no one is using it. */
>>   	if (mod->exit != NULL)
>>   		mod->exit();
>> -	blocking_notifier_call_chain(&module_notify_list,
>> +	blocking_notifier_call_chain_reverse(&module_notify_list,
>>   				     MODULE_STATE_GOING, mod);
>> -	klp_module_going(mod);
>> -	ftrace_release_mod(mod);
>>   
>>   	async_synchronize_full();
>>   
>> @@ -3135,10 +3133,8 @@ static noinline int do_init_module(struct module *mod)
>>   	mod->state = MODULE_STATE_GOING;
>>   	synchronize_rcu();
>>   	module_put(mod);
>> -	blocking_notifier_call_chain(&module_notify_list,
>> +	blocking_notifier_call_chain_reverse(&module_notify_list,
>>   				     MODULE_STATE_GOING, mod);
>> -	klp_module_going(mod);
>> -	ftrace_release_mod(mod);
>>   	free_module(mod);
>>   	wake_up_all(&module_wq);
>>   
> 
> The patch unexpectedly leaves a call to ftrace_free_mem() in
> do_init_module().

Thanks for pointing it out, it was removed when i implemented and 
tested, but when i organized the patch, it was left. I will remove it.

> 
>> @@ -3281,20 +3277,14 @@ static int complete_formation(struct module *mod, struct load_info *info)
>>   	return err;
>>   }
>>   
>> -static int prepare_coming_module(struct module *mod)
>> +static int prepare_module_state_transaction(struct module *mod,
>> +			unsigned long val_up, unsigned long val_down)
>>   {
>>   	int err;
>>   
>> -	ftrace_module_enable(mod);
>> -	err = klp_module_coming(mod);
>> -	if (err)
>> -		return err;
>> -
>>   	err = blocking_notifier_call_chain_robust(&module_notify_list,
>> -			MODULE_STATE_COMING, MODULE_STATE_GOING, mod);
>> +			val_up, val_down, mod);
>>   	err = notifier_to_errno(err);
>> -	if (err)
>> -		klp_module_going(mod);
>>   
>>   	return err;
>>   }
>> @@ -3468,14 +3458,21 @@ static int load_module(struct load_info *info, const char __user *uargs,
>>   	init_build_id(mod, info);
>>   
>>   	/* Ftrace init must be called in the MODULE_STATE_UNFORMED state */
>> -	ftrace_module_init(mod);
>> +	err = prepare_module_state_transaction(mod,
>> +				MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
> 
> I believe val_down should be MODULE_STATE_GOING to reverse the
> operation. Why is the new state MODULE_STATE_FORMED needed here?
to avoid this:

case MODULE_STATE_COMING:
      kmalloc();
case MODULE_STATE_GOING:
      kfree();



> 
>> +	if (err)
>> +		goto ddebug_cleanup;
>>   
>>   	/* Finally it's fully formed, ready to start executing. */
>>   	err = complete_formation(mod, info);
>> -	if (err)
>> +	if (err) {
>> +		blocking_notifier_call_chain_reverse(&module_notify_list,
>> +				MODULE_STATE_FORMED, mod);
>>   		goto ddebug_cleanup;
>> +	}
>>   
>> -	err = prepare_coming_module(mod);
>> +	err = prepare_module_state_transaction(mod,
>> +				MODULE_STATE_COMING, MODULE_STATE_GOING);
>>   	if (err)
>>   		goto bug_cleanup;
>>   
>> @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
>>   	destroy_params(mod->kp, mod->num_kp);
>>   	blocking_notifier_call_chain(&module_notify_list,
>>   				     MODULE_STATE_GOING, mod);
> 
> My understanding is that all notifier chains for MODULE_STATE_GOING
> should be reversed.
yes, all, from lowest priority notifier to highest.
I will resend patch 1 which was failed due to my proxy setting.

> 
>> -	klp_module_going(mod);
>>    bug_cleanup:
>>   	mod->state = MODULE_STATE_GOING;
>>   	/* module_bug_cleanup needs module_mutex protection */
> 
> The patch removes the klp_module_going() cleanup call in load_module().
> Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
> should be removed and appropriately replaced with a cleanup via
> a notifier.
> 
     err = prepare_module_state_transaction(mod,
                 MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
     if (err)
         goto ddebug_cleanup;

ftrace will be cleanup in blocking_notifier_call_chain_robust rolling back.

     err = prepare_module_state_transaction(mod,
                 MODULE_STATE_COMING, MODULE_STATE_GOING);

each notifier including ftrace and klp will be cleanup in 
blocking_notifier_call_chain_robust rolling back.

if all notifiers are successful in MODULE_STATE_COMING, they all will be 
clean up in
  coming_cleanup:
     mod->state = MODULE_STATE_GOING;
     destroy_params(mod->kp, mod->num_kp);
     blocking_notifier_call_chain(&module_notify_list,
                      MODULE_STATE_GOING, mod);

if  something wrong underneath.

>> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
>> index 8df69e702706..efedb98d3db4 100644
>> --- a/kernel/trace/ftrace.c
>> +++ b/kernel/trace/ftrace.c
>> @@ -5241,6 +5241,44 @@ static int __init ftrace_mod_cmd_init(void)
>>   }
>>   core_initcall(ftrace_mod_cmd_init);
>>   
>> +static int ftrace_module_callback(struct notifier_block *nb, unsigned long op,
>> +			void *module)
>> +{
>> +	struct module *mod = module;
>> +
>> +	switch (op) {
>> +	case MODULE_STATE_UNFORMED:
>> +		ftrace_module_init(mod);
>> +		break;
>> +	case MODULE_STATE_COMING:
>> +		ftrace_module_enable(mod);
>> +		break;
>> +	case MODULE_STATE_LIVE:
>> +		ftrace_free_mem(mod, mod->mem[MOD_INIT_TEXT].base,
>> +				mod->mem[MOD_INIT_TEXT].base + mod->mem[MOD_INIT_TEXT].size);
>> +		break;
>> +	case MODULE_STATE_GOING:
>> +	case MODULE_STATE_FORMED:
>> +		ftrace_release_mod(mod);
>> +		break;
>> +	default:
>> +		break;
>> +	}
> 
> ftrace_module_init(), ftrace_module_enable(), ftrace_free_mem() and
> ftrace_release_mod() should be newly used only in kernel/trace/ftrace.c
> where they are also defined. The functions can then be made static and
> removed from include/linux/ftrace.h.
> 
> Nit: The default case in the switch can be removed.
> 

accepted.

>> +
>> +	return notifier_from_errno(0);
> 
> Nit: This can be simply "return NOTIFY_OK;".

accepted
> 
>> +}
>> +
>> +static struct notifier_block ftrace_module_nb = {
>> +	.notifier_call = ftrace_module_callback,
>> +	.priority = MODULE_NOTIFIER_PRIO_HIGH
>> +};
>> +
>> +static int __init ftrace_register_module_notifier(void)
>> +{
>> +	return register_module_notifier(&ftrace_module_nb);
>> +}
>> +core_initcall(ftrace_register_module_notifier);
>> +
>>   static void function_trace_probe_call(unsigned long ip, unsigned long parent_ip,
>>   				      struct ftrace_ops *op, struct ftrace_regs *fregs)
>>   {
> 

Best regards

Song


^ permalink raw reply

* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Hanabishi @ 2026-04-14 21:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Eric Naim, LKML, Calvin Owens,
	Peter Zijlstra, Anna-Maria Behnsen, Ingo Molnar, John Stultz,
	Stephen Boyd, Alexander Viro, Christian Brauner, Jan Kara,
	linux-fsdevel, Sebastian Reichel, linux-pm, Pablo Neira Ayuso,
	Florian Westphal, Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <87340xfeje.ffs@tglx>

On 14/04/2026 20:55, Thomas Gleixner wrote:
> The one below should cover all possible holes.
> 
> Thanks,
> 
>          tglx
> ---
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index b4d730604972..5e22697b098d 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -94,6 +94,9 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
>   	if (dev->features & CLOCK_EVT_FEAT_DUMMY)
>   		return 0;
>   
> +	/* On state transitions clear the forced flag unconditionally */
> +	dev->next_event_forced = 0;
> +
>   	/* Transition with new state-specific callbacks */
>   	switch (state) {
>   	case CLOCK_EVT_STATE_DETACHED:
> @@ -366,8 +369,10 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, b
>   	if (delta > (int64_t)dev->min_delta_ns) {
>   		delta = min(delta, (int64_t) dev->max_delta_ns);
>   		cycles = ((u64)delta * dev->mult) >> dev->shift;
> -		if (!dev->set_next_event((unsigned long) cycles, dev))
> +		if (!dev->set_next_event((unsigned long) cycles, dev)) {
> +			dev->next_event_forced = 0;
>   			return 0;
> +		}
>   	}
>   
>   	if (dev->next_event_forced)
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index 7e57fa31ee26..115e0bf01276 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -108,6 +108,7 @@ static struct clock_event_device *tick_get_oneshot_wakeup_device(int cpu)
>   
>   static void tick_oneshot_wakeup_handler(struct clock_event_device *wd)
>   {
> +	wd->next_event_forced = 0;
>   	/*
>   	 * If we woke up early and the tick was reprogrammed in the
>   	 * meantime then this may be spurious but harmless.

Ok, it does fix the problem! Thank you.
The patch itself does not apply cleanly for 7.0 though and I had to adapt it a bit.


^ permalink raw reply

* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Thomas Gleixner @ 2026-04-14 20:55 UTC (permalink / raw)
  To: Hanabishi, Frederic Weisbecker
  Cc: Eric Naim, LKML, Calvin Owens, Peter Zijlstra, Anna-Maria Behnsen,
	Ingo Molnar, John Stultz, Stephen Boyd, Alexander Viro,
	Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
	linux-pm, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	netfilter-devel, coreteam
In-Reply-To: <a3ac856c-914c-4b39-949f-634bed501e7c@gmail.com>

On Tue, Apr 14 2026 at 18:25, Hanabishi wrote:
> On 14/04/2026 18:04, Frederic Weisbecker wrote:
>
> This patch doesn't help me unfortunately. Thanks.

The one below should cover all possible holes.

Thanks,

        tglx
---
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index b4d730604972..5e22697b098d 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -94,6 +94,9 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
 	if (dev->features & CLOCK_EVT_FEAT_DUMMY)
 		return 0;
 
+	/* On state transitions clear the forced flag unconditionally */
+	dev->next_event_forced = 0;
+
 	/* Transition with new state-specific callbacks */
 	switch (state) {
 	case CLOCK_EVT_STATE_DETACHED:
@@ -366,8 +369,10 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, b
 	if (delta > (int64_t)dev->min_delta_ns) {
 		delta = min(delta, (int64_t) dev->max_delta_ns);
 		cycles = ((u64)delta * dev->mult) >> dev->shift;
-		if (!dev->set_next_event((unsigned long) cycles, dev))
+		if (!dev->set_next_event((unsigned long) cycles, dev)) {
+			dev->next_event_forced = 0;
 			return 0;
+		}
 	}
 
 	if (dev->next_event_forced)
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 7e57fa31ee26..115e0bf01276 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -108,6 +108,7 @@ static struct clock_event_device *tick_get_oneshot_wakeup_device(int cpu)
 
 static void tick_oneshot_wakeup_handler(struct clock_event_device *wd)
 {
+	wd->next_event_forced = 0;
 	/*
 	 * If we woke up early and the tick was reprogrammed in the
 	 * meantime then this may be spurious but harmless.

^ permalink raw reply related

* Re: [PATCH 3/3] pmdomain: qcom: rpmhpd: Add power domains for Nord SoC
From: Dmitry Baryshkov @ 2026-04-14 19:27 UTC (permalink / raw)
  To: Shawn Guo
  Cc: Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Kamal Wadhwa, Taniya Das,
	Bartosz Golaszewski, Deepti Jaggi, linux-arm-msm, linux-pm,
	devicetree, linux-kernel
In-Reply-To: <20260414035909.652992-4-shengchao.guo@oss.qualcomm.com>

On Tue, Apr 14, 2026 at 11:59:09AM +0800, Shawn Guo wrote:
> From: Kamal Wadhwa <kamal.wadhwa@oss.qualcomm.com>
> 
> Add RPMh power domains required for Nord SoC.  This includes
> new definitions for power domains supplying GFX1 and NSP3 subsystem.
> 
> Co-developed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> Signed-off-by: Kamal Wadhwa <kamal.wadhwa@oss.qualcomm.com>
> Signed-off-by: Shawn Guo <shengchao.guo@oss.qualcomm.com>
> ---
>  drivers/pmdomain/qcom/rpmhpd.c | 35 ++++++++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
> 

Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>


-- 
With best wishes
Dmitry

^ permalink raw reply

* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Hanabishi @ 2026-04-14 18:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Eric Naim, Thomas Gleixner, LKML, Calvin Owens, Peter Zijlstra,
	Anna-Maria Behnsen, Ingo Molnar, John Stultz, Stephen Boyd,
	Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
	Sebastian Reichel, linux-pm, Pablo Neira Ayuso, Florian Westphal,
	Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <ad6BtKRj1GyreNCS@localhost.localdomain>

On 14/04/2026 18:04, Frederic Weisbecker wrote:
> Can you try the following?
> 
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index b4d730604972..5c6dfd6bed28 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -100,6 +100,7 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
>   		/* The clockevent device is getting replaced. Shut it down. */
>   
>   	case CLOCK_EVT_STATE_SHUTDOWN:
> +		dev->next_event_forced = 0;
>   		if (dev->set_state_shutdown)
>   			return dev->set_state_shutdown(dev);
>   		return 0;
> @@ -127,10 +128,12 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
>   			      clockevent_get_state(dev)))
>   			return -EINVAL;
>   
> -		if (dev->set_state_oneshot_stopped)
> +		if (dev->set_state_oneshot_stopped) {
> +			dev->next_event_forced = 0;
>   			return dev->set_state_oneshot_stopped(dev);
> -		else
> +		} else {
>   			return -ENOSYS;
> +		}
>   
>   	default:
>   		return -ENOSYS;
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index 7e57fa31ee26..115e0bf01276 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -108,6 +108,7 @@ static struct clock_event_device *tick_get_oneshot_wakeup_device(int cpu)
>   
>   static void tick_oneshot_wakeup_handler(struct clock_event_device *wd)
>   {
> +	wd->next_event_forced = 0;
>   	/*
>   	 * If we woke up early and the tick was reprogrammed in the
>   	 * meantime then this may be spurious but harmless.

This patch doesn't help me unfortunately. Thanks.


^ permalink raw reply

* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Eric Naim @ 2026-04-14 18:19 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Hanabishi, Thomas Gleixner, LKML, Peter Zijlstra,
	Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar, John Stultz,
	Stephen Boyd, Alexander Viro, Christian Brauner, Jan Kara,
	linux-fsdevel, Sebastian Reichel, linux-pm, Pablo Neira Ayuso,
	Florian Westphal, Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <ad54kGakZkvCoRaT@mozart.vkv.me>

On 4/15/26 1:25 AM, Calvin Owens wrote:
> On Tuesday 04/14 at 15:39 +0000, Eric Naim wrote:
>> On 4/14/26 5:20 AM, Hanabishi wrote:
>>>
>>> Hello.
>>>
>>> Sorry, but this patch as of 7.0 introduced *severe* periodic lockups on my
>>> Ryzen 7700X machine.
>>> I see such messages in the log:
>>>
>>> clocksource: Long readout interval, skipping watchdog check: cs_nsec:
>>> 2897344852 wd_nsec: 2897356996
>>>
>>> Reverting d6e152d905bdb1f32f9d99775e2f453350399a6a for mainline fixes the
>>> issue for me.
>>>
>>
>> Hi maintainers,
>>
>> several users from CachyOS has reported this regression as well. We landed on
>> the same bisection. One of the users that could reproduce this reliably
>> reproduced this just by watching a YouTube video in a browser, and observed
>> freezes and stutters when interacting with the system.
> 
> Huh, I can't reproduce this at all across 10+ machines. Can you share
> the Kconfig you're seeing this on?

Right, here it is [1]. CachyOS does carry a lot of downstream patches, but I
made sure to reproduce this on mainline before reporting here.

[1]
https://github.com/CachyOS/linux-cachyos/blob/4224303b6d7a50dd1cc3ffa78864050cc9536eec/linux-cachyos/config

> 
> Thanks,
> Calvin

-- 
Regards,
  Eric

^ permalink raw reply

* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Frederic Weisbecker @ 2026-04-14 18:04 UTC (permalink / raw)
  To: Eric Naim
  Cc: Hanabishi, Thomas Gleixner, LKML, Calvin Owens, Peter Zijlstra,
	Anna-Maria Behnsen, Ingo Molnar, John Stultz, Stephen Boyd,
	Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
	Sebastian Reichel, linux-pm, Pablo Neira Ayuso, Florian Westphal,
	Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <aeb848aa-404a-40fb-bd41-329644623b1d@cachyos.org>

Le Tue, Apr 14, 2026 at 03:39:00PM +0000, Eric Naim a écrit :
> On 4/14/26 5:20 AM, Hanabishi wrote:
> > 
> > Hello.
> > 
> > Sorry, but this patch as of 7.0 introduced *severe* periodic lockups on my
> > Ryzen 7700X machine.
> > I see such messages in the log:
> > 
> > clocksource: Long readout interval, skipping watchdog check: cs_nsec:
> > 2897344852 wd_nsec: 2897356996
> > 
> > Reverting d6e152d905bdb1f32f9d99775e2f453350399a6a for mainline fixes the
> > issue for me.
> > 
> 
> Hi maintainers,
> 
> several users from CachyOS has reported this regression as well. We landed on
> the same bisection. One of the users that could reproduce this reliably
> reproduced this just by watching a YouTube video in a browser, and observed
> freezes and stutters when interacting with the system.
> 
> I had an LLM generate a fix (patch attached), and it fixed the regression for
> that user. Full disclosure: it is written completely by AI, and I am also not
> familiar with this subsystem. I just hope that this patch can be helpful in
> fixing the regression.
> 
> Please don't hesitate to tell me off if utilizing AI in this way is not
> helpful, so I can keep this in mind for future contributions.
> 
> 
> -- 
> Regards,
>   Eric

> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 38570998a19b..37b10045572e 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -332,8 +332,10 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
>  	if (delta > (int64_t)dev->min_delta_ns) {
>  		delta = min(delta, (int64_t) dev->max_delta_ns);
>  		clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
> -		if (!dev->set_next_event((unsigned long) clc, dev))
> +		if (!dev->set_next_event((unsigned long) clc, dev)) {
> +			dev->next_event_forced = 0;
>  			return 0;
> +		}
>  	}
>  
>  	if (dev->next_event_forced)
> diff --git a/kernel/time/tick-oneshot.c b/kernel/time/tick-oneshot.c
> index 7472597f3225..bf411472d4f7 100644
> --- a/kernel/time/tick-oneshot.c
> +++ b/kernel/time/tick-oneshot.c
> @@ -34,6 +34,7 @@ int tick_program_event(ktime_t expires, int force)
>  		 */
>  		clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT_STOPPED);
>  		dev->next_event = KTIME_MAX;
> +		dev->next_event_forced = 0;
>  		return 0;
>  	}
>  
> @@ -43,6 +44,7 @@ int tick_program_event(ktime_t expires, int force)
>  		 * before using it.
>  		 */
>  		clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT);
> +		dev->next_event_forced = 0;
>  	}
>  
>  	return clockevents_program_event(dev, expires, force);

That diff suggest that dev->next_event_forced is not properly cleared by
a handler or when the device is stopped.

For example it's not cleared when the device is oneshot stopped.

It's also not cleared when the device is detached (though that shouldn't
matter much) and also when the broadcast wakeup thing is used.

Can you try the following?

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index b4d730604972..5c6dfd6bed28 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -100,6 +100,7 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
 		/* The clockevent device is getting replaced. Shut it down. */
 
 	case CLOCK_EVT_STATE_SHUTDOWN:
+		dev->next_event_forced = 0;
 		if (dev->set_state_shutdown)
 			return dev->set_state_shutdown(dev);
 		return 0;
@@ -127,10 +128,12 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
 			      clockevent_get_state(dev)))
 			return -EINVAL;
 
-		if (dev->set_state_oneshot_stopped)
+		if (dev->set_state_oneshot_stopped) {
+			dev->next_event_forced = 0;
 			return dev->set_state_oneshot_stopped(dev);
-		else
+		} else {
 			return -ENOSYS;
+		}
 
 	default:
 		return -ENOSYS;
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 7e57fa31ee26..115e0bf01276 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -108,6 +108,7 @@ static struct clock_event_device *tick_get_oneshot_wakeup_device(int cpu)
 
 static void tick_oneshot_wakeup_handler(struct clock_event_device *wd)
 {
+	wd->next_event_forced = 0;
 	/*
 	 * If we woke up early and the tick was reprogrammed in the
 	 * meantime then this may be spurious but harmless.

^ permalink raw reply related

* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 17:32 UTC (permalink / raw)
  To: Kairui Song
  Cc: Liam.Howlett, akpm, apopple, axelrasmussen, baohua, baolin.wang,
	bhe, byungchul, cgroups, chengming.zhou, chrisl, corbet, david,
	dev.jain, gourry, hannes, hughd, jannh, joshua.hahnjy, lance.yang,
	lenb, linux-doc, linux-kernel, linux-mm, linux-pm,
	lorenzo.stoakes, matthew.brost, mhocko, muchun.song, npache,
	pavel, peterx, peterz, pfalcato, rafael, rakie.kim,
	roman.gushchin, rppt, ryan.roberts, shakeel.butt, shikemeng,
	surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed, yuanchu,
	zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAKEwX=NrUhUrAFx+8BYJEfaVKpCm-H9JhBzYSrqOQb-NW7QRug@mail.gmail.com>

On Tue, Apr 14, 2026 at 10:23 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> * I still think there's a good chance we can *significantly* close the
> gap overall between a design with virtual swap and a design without.
> It's a bit premature to commit to a vswap-optional route (which to be
> completely honest I'm still not confident is possible to satisfy all
> of our requirements).

And to further note - these benchmark measure, in effect, purely swap
overhead. In a production environment with a lot of non-swap work, as
long as the gap is close enough I think we would be fine, even for a
hostile case like a fast swapfile-backend (I assume SSD swap's
bottleneck will be the IO mostly).

I will stare at your responses to see if there is other benchmark I
can play with, but it would be very helpful if you can share your full
suite :)

^ permalink raw reply

* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Calvin Owens @ 2026-04-14 17:25 UTC (permalink / raw)
  To: Eric Naim
  Cc: Hanabishi, Thomas Gleixner, LKML, Peter Zijlstra,
	Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar, John Stultz,
	Stephen Boyd, Alexander Viro, Christian Brauner, Jan Kara,
	linux-fsdevel, Sebastian Reichel, linux-pm, Pablo Neira Ayuso,
	Florian Westphal, Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <aeb848aa-404a-40fb-bd41-329644623b1d@cachyos.org>

On Tuesday 04/14 at 15:39 +0000, Eric Naim wrote:
> On 4/14/26 5:20 AM, Hanabishi wrote:
> > 
> > Hello.
> > 
> > Sorry, but this patch as of 7.0 introduced *severe* periodic lockups on my
> > Ryzen 7700X machine.
> > I see such messages in the log:
> > 
> > clocksource: Long readout interval, skipping watchdog check: cs_nsec:
> > 2897344852 wd_nsec: 2897356996
> > 
> > Reverting d6e152d905bdb1f32f9d99775e2f453350399a6a for mainline fixes the
> > issue for me.
> > 
> 
> Hi maintainers,
> 
> several users from CachyOS has reported this regression as well. We landed on
> the same bisection. One of the users that could reproduce this reliably
> reproduced this just by watching a YouTube video in a browser, and observed
> freezes and stutters when interacting with the system.

Huh, I can't reproduce this at all across 10+ machines. Can you share
the Kconfig you're seeing this on?

Thanks,
Calvin

^ permalink raw reply

* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 17:23 UTC (permalink / raw)
  To: Kairui Song
  Cc: Liam.Howlett, akpm, apopple, axelrasmussen, baohua, baolin.wang,
	bhe, byungchul, cgroups, chengming.zhou, chrisl, corbet, david,
	dev.jain, gourry, hannes, hughd, jannh, joshua.hahnjy, lance.yang,
	lenb, linux-doc, linux-kernel, linux-mm, linux-pm,
	lorenzo.stoakes, matthew.brost, mhocko, muchun.song, npache,
	pavel, peterx, peterz, pfalcato, rafael, rakie.kim,
	roman.gushchin, rppt, ryan.roberts, shakeel.butt, shikemeng,
	surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed, yuanchu,
	zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAKEwX=P4syV38jAVCWq198r2OHXXc=xA-fx1dk6+qYef6yzxWQ@mail.gmail.com>

On Mon, Mar 23, 2026 at 1:05 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Mar 23, 2026 at 12:41 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2026 at 11:33 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Mon, Mar 23, 2026 at 6:09 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > > > This patch series is based on 6.19. There are a couple more
> > > > > swap-related changes in mainline that I would need to coordinate
> > > > > with, but I still want to send this out as an update for the
> > > > > regressions reported by Kairui Song in [15]. It's probably easier
> > > > > to just build this thing rather than dig through that series of
> > > > > emails to get the fix patch :)
> > > > >
> > > > > Changelog:
> > > > > * v4 -> v5:
> > > > >     * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
> > > > >     * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
> > > > >       and use guard(rcu) in vswap_cpu_dead
> > > > >       (reported by Peter Zijlstra [17]).
> > > > > * v3 -> v4:
> > > > >     * Fix poor swap free batching behavior to alleviate a regression
> > > > >       (reported by Kairui Song).
> > > >
> > >
> > > Hi Kairui! Thanks a lot for the testing big boss :) I will focus on
> > > the regression in this patch series - we can talk more about
> > > directions in another thread :)

Hi Kairui,

My apologies if I missed your response, but could you share with me
your full benchmark suite? It would be hugely useful, not just for
this series, but for all swap contributions in the future :) We should
do as much homework ourselves as possible :P

And apologies for the delayed response. I kept having to back and
forth between regression investigating, and figuring out what was
going on with the build setups (I missed some of the CONFIGs you had
originally), reducing variance on hosts, etc.

I don't have PMEM, so I have only worked with zram backend so far. I
did manage to reproduce the regressions you showed me (albeit at a
much smaller gap on certain metrics than your cited numbers, which I
suspect is due to zram/pmem difference).

There are two benchmarks that I focused on:

1. Usemem - the exact command I ran is: time ./usemem --init-time -O
-y -x -n 1 56G

My host is 32GB, 52 processor(s) / x86_64.

Build        real (s)          vs base   sys (s)           tput (KB/s)
       free_ms
baseline     175.6 +/- 3.6      —        121.9 +/- 3.3    391,941 +/-
8,333  6,992 +/- 204
vss_v5       184.0 +/- 3.9    +4.8%      130.5 +/- 3.8    376,192 +/-
8,581  8,297 +/- 247

(I hope the formatting works, but let me know if it looks weird).

2. Memhog: time memhog 48G

My host for this one is 16 GB, 52 processors, x86_64 too.

Build        real (s)          vs base   sys (s)
baseline      80.5 +/- 1.9      —         62.7 +/- 2.0
vss_v5        83.0 +/- 1.8    +3.1%       65.7 +/- 1.8

On both benchmark, I enable MGLRU, to more closely match the setup you had.

Staring at the run logs (and double check with the logs you sent me to
make sure it's not just on my system), there are some common patterns
I noticed across these runs:

1. Kswapd is slower on the vswap side, which shifts work towards
direct reclaim, and makes compaction have to run harder (which has a
weird contention through zsmalloc - I can expand further, but this is
not vswap-specific, just exacerbated by slower kswapd).

2. Higher swap readahead (albeit with higher hit rate) - this is more
of an artifact of the fact that zero swap pages are no longer backed
by zram swapfile, which skipped readahead in certain paths. We can
ignore this for now, but worth assessing this for fast swap backends
in general (zero swap pages, zswap, so on and so forth).

I spent sometimes perf-ing kswapd, and hack the usemem binary a bit so
that I can perf the free stage of usemem separately. Most of the
vswap-specific overhead lies in the xarray lookups. Some big offenders
on top of my mind:

1. Right now, in the physical swap allocator, whenever we have an
allocated slot in the range we're checking, we check if that slot is
swap-cache-only (i.e no swap count), and if so we try to free it (if
swapfile is almost full etc.). This check is cheap if all swap entry
metadata live in physical swap layer only, but more expensive when you
have to go through another layer of indirection :)

I fixed that by just taking one bit in the reverse map to track
swap-cache-only state, which eliminates this without extra space
overhead (on top of the existing design).

2. On the free path, in swap_pte_batch(), we check cgroup to make sure
that the range we pass to free_swap_and_cache_nr() belongs to the same
cgroup, which has a per-PTE overhead for going to the vswap layer. We
can make this check once-per range instead, to reduce overhead. Even
better - we can skip this check in swap_pte_batch() for the free case,
and deferred this check to later on where we already enter vswap
cluster lock context :)

With a bunch of changes like that, I closed the gap majorly:

usemem:
Build        real (s)          vs base   sys (s)           tput (KB/s)
       free_ms
baseline     175.6 +/- 3.6      —        121.9 +/- 3.3    391,941 +/-
8,333  6,992 +/- 204
new_opt_v2   179.8 +/- 3.0    +2.4%      126.1 +/- 2.9    382,536 +/-
6,662  7,105 +/- 183

memhog:
Build        real (s)          vs base   sys (s)
baseline      80.5 +/- 1.9      —         62.7 +/- 2.0
new_opt_v2    79.9 +/- 1.7    -0.8%       62.4 +/- 1.7

I would like to also point out that, some of this overhead is specific
to the swapfile backend case, which is why we don't see this in zswap
in the stats I included in V5. Zswap does not require this
swap-cache-only dance, because in virtual swap, zswap only needs the
virtual swap slot as the index (on top of much more negligible space
overhead thanks to zswap tree merging into vswap cluster, no swap
charging, no double allocation, etc.).

Anyway, still a small gap. The next idea that I have is inspired by
TLB, which cache virtual->physical memory address translation. I added
a per-CPU MRU virtual cluster. The idea is that a lot of consecutive
swap operations operate on the same range of swap entries - merging
these operations of course makes the most sense, but sometimes it's
not convenient to do it. The non-vswap, old design sometimes lock the
physical swap cluster and expose the swap cluster struct to callers to
pass around, but I would like to avoid that if possible :)

With this change, we close the gap even further - exceeding the
baseline in average in certain cases, but as you can see it's within
noises so I wouldn't conclude too much out of it:

usemem:
Build        real (s)          vs base   sys (s)           tput (KB/s)
       free_ms
baseline     175.6 +/- 3.6      —        121.9 +/- 3.3    391,941 +/-
8,333  6,992 +/- 204
cc_v2        176.4 +/- 5.3    +0.4%      123.6 +/- 5.4    390,405 +/-
12,792 6,987 +/- 296


memhog:
Build        real (s)          vs base   sys (s)
baseline      80.5 +/- 1.9      —         62.7 +/- 2.0
cc_v2         79.9 +/- 0.9    -0.8%       62.1 +/- 1.5

The reclaim and compaction stats tell a similar story:

Reclaim / Compaction (usemem)
Metric                               baseline
vss_v5                   new_opt_v2                        cc_v2
allocstall                 167,787 +/- 10,292           170,532 +/-
15,185           169,782 +/- 9,903            168,635 +/- 13,526
pgsteal_kswapd          6,932,143 +/- 186,411        6,965,962 +/-
288,323        6,968,188 +/- 286,383        7,038,513 +/- 202,696
pgsteal_direct          9,759,350 +/- 480,674        9,978,721 +/-
765,543        9,899,698 +/- 480,781        9,845,668 +/- 544,319
swap_ra                        82.9 +/- 22.6             5994.8 +/-
2817.5            4976.8 +/- 1484.2            4718.2 +/- 1510.5
pgmigrate               1,029,901 +/- 428,416        1,687,072 +/-
399,505        1,260,451 +/- 202,603        1,144,560 +/- 490,177

Reclaim / Compaction (memhog)
Metric                               baseline
vss_v5                   new_opt_v2                        cc_v2
allocstall                 101,245 +/- 6,271            109,320 +/-
12,180           100,207 +/- 11,053            99,223 +/- 9,905
pgsteal_kswapd          8,817,264 +/- 432,519        8,436,548 +/-
265,763        8,728,944 +/- 305,101        8,962,443 +/- 589,012
pgsteal_direct          5,408,046 +/- 394,775        5,932,611 +/-
584,873        5,419,891 +/- 551,226        5,349,352 +/- 601,655
swap_ra                        66.5 +/- 22.8             8589.5 +/-
3325.1            8954.5 +/- 2661.9            8703.1 +/- 1746.6
pgmigrate                  239,410 +/- 46,014           277,193 +/-
71,487           320,672 +/- 59,488          243,989 +/- 136,129

You can see that the latter versions gradually restore the behaviors
of baseline in terms of reclaim dynamics :)

Some final remarks:
* I still think there's a good chance we can *significantly* close the
gap overall between a design with virtual swap and a design without.
It's a bit premature to commit to a vswap-optional route (which to be
completely honest I'm still not confident is possible to satisfy all
of our requirements).

* Regardless of the direction we take, these are all pitfalls that
will be problematic for virtual swap design, and more generally some
of them will affect any dynamic swap design (which has to go through
some sort of indirection or a dynamic data structure like xarray that
will induce some amount of lookup overhead). I hope my work here can
be useful in this sense too, outside of this specific vswap direction
:)

I will clean things up a bit and send you a v6 for further inspection.
Once again, I'd like to express my gratitude for your engagement and
feedback.

^ permalink raw reply

* [PATCH v2] PM: hibernate: keep existing uswsusp swap pin if re-selection fails
From: DaeMyung Kang @ 2026-04-14 16:49 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki
  Cc: Youngjun Park, Kairui Song, Chris Li, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Len Brown, Pavel Machek, linux-mm,
	linux-pm, linux-kernel, DaeMyung Kang
In-Reply-To: <20260414143200.1267932-1-charsyam@gmail.com>

Commit 5b2b0c6e4577 ("mm/swap, PM: hibernate: fix swapoff race in
uswsusp by pinning swap device") introduced SWP_HIBERNATION so that
the swap area selected through /dev/snapshot remains protected against
swapoff() for the lifetime of the uswsusp session.

When user space issues SNAPSHOT_SET_SWAP_AREA again,
snapshot_set_swap_area() currently drops the old pin before attempting
to pin the new swap area.  If the new selection fails, the ioctl
returns an error and user space is expected to abort the session.
However, preserving the existing pin in that case makes the kernel
side more robust against a failed re-selection, while keeping the
existing userspace-visible behavior unchanged.

Implement this with the existing swap helpers:

  - look up the requested swap area first
  - treat re-selecting the already pinned area as a no-op
  - pin the new area before unpinning the old one
  - leave the existing pin in place if the new pin attempt fails

This keeps the hibernation session protected against swapoff() until
/dev/snapshot is closed, even after a failed attempt to switch to a
different swap area.

Suggested-by: Youngjun Park <youngjun.park@lge.com>
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
---
Notes (not part of the commit, stripped by git am):

Changes in v2:
  - Drop Fixes: and Cc: stable; reframe as a hardening improvement
    rather than a regression fix, per Youngjun's feedback that the
    current behavior is intentional and there is no concrete
    user-observable harm.
  - Drop the new repin_hibernation_swap_type() helper. Rework
    snapshot_set_swap_area() in place using the existing find / pin /
    unpin helpers as Youngjun suggested; the change now touches only
    kernel/power/user.c and adds no new API.
  - Update the subject and commit log accordingly.
  - Add Suggested-by: trailer.

v1: https://lore.kernel.org/lkml/20260414143200.1267932-1-charsyam@gmail.com/

Baseline
--------
This patch is generated against linux-next at commit 5b2b0c6e4577
("mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap
device"). Mainline does not yet carry that commit, and neither the
helpers used here (find/pin/unpin_hibernation_swap_type) nor the code
site this patch modifies exist there. The base-commit trailer at the
bottom of the mbox records the exact commit.

Testing
-------
The behavior change can be exercised entirely through the
/dev/snapshot ioctl path; no actual hibernation cycle is required.
A targeted assertion test is below; run it as root in a throwaway VM
with two active swap block devices and one non-swap block device
(three arguments).

Run inside a VM on linux-next at 5b2b0c6e4577 with this patch applied:

  step1: pinned active swap /dev/vda
  step2: swapoff blocked with EBUSY while pin is held
  step3: repinned active swap to /dev/vdb
  step4: swapoff(/dev/vda) succeeded after repinning away
  step5: repinned swap is blocked with EBUSY
  step6: bogus SNAPSHOT_SET_SWAP_AREA failed as expected: No such device
  step7: swapoff(/dev/vdb) is still blocked with EBUSY
  result: pin preserved across failed re-set (hardened behavior)
  step8: swapoff succeeded after closing /dev/snapshot

Without the patch, step7 instead reports
  swapoff(/dev/vdb) succeeded after failed re-set
because the old pin had been released before the failed pin attempt.

What the assertion test covers:
  - SWP_HIBERNATION is enforced against swapoff (step2, step5);
  - the success path moves the pin from one active swap to another
    (step3, step4, step5);
  - a failed re-selection preserves the existing pin (step6, step7);
  - the pin lifetime ends on /dev/snapshot close (step8).

What it does not cover:
  - the snapshot_open(O_RDONLY) initial resume-device pin path;
  - the full suspend-to-disk image create/restore flow;
  - concurrent swapoff racing against SNAPSHOT_SET_SWAP_AREA;
  - the type == data->swap idempotent branch (not externally
    observable since it intentionally skips the bit toggle).

A normal sysfs-based suspend-to-disk cycle continues to work; the
find_hibernation_swap_type() / pin / unpin paths themselves are
unchanged. Build tested with allmodconfig and run-tested with
CONFIG_PROVE_LOCKING=y and CONFIG_KASAN=y. The VM was booted with
oops=panic panic=-1 so any WARN/Oops/BUG would have halted the run;
the full test completed cleanly with no kernel log diagnostics.

Reproducer (C source, for reference only -- not added to the tree):

 // SPDX-License-Identifier: GPL-2.0
 /*
  * Reproduce / verify the SNAPSHOT_SET_SWAP_AREA pin-lifetime behavior.
  *
  * Run only inside a throwaway VM. The test manipulates swap state and
  * leaves the target swap area disabled on success.
  *
  * Usage:
  *   ./uswsusp_swapoff_repro <active-swap-1> <active-swap-2> <bogus-blk>
  *
  * Exit codes:
  *   0 = expected (hardened) behavior: pin preserved across failed re-set
  *   1 = old behavior: pin dropped on failed re-set
  *   2 = setup error / inconclusive
  */

 #define _GNU_SOURCE
 #include <errno.h>
 #include <fcntl.h>
 #include <linux/types.h>
 #include <linux/suspend_ioctls.h>
 #include <stdbool.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <sys/ioctl.h>
 #include <sys/stat.h>
 #include <sys/swap.h>
 #include <sys/sysmacros.h>
 #include <unistd.h>

 static int encode_dev(dev_t dev)
 {
 	unsigned int major_num = major(dev);
 	unsigned int minor_num = minor(dev);

 	/* Match new_encode_dev() / new_decode_dev() in the kernel. */
 	return (major_num & 0xfff) << 8 |
 	       (minor_num & 0xff) |
 	       ((minor_num & ~0xff) << 12);
 }

 static int get_block_dev(const char *path, dev_t *dev)
 {
 	struct stat st;

 	if (stat(path, &st) < 0) {
 		fprintf(stderr, "stat(%s): %s\n", path, strerror(errno));
 		return -errno;
 	}
 	if (!S_ISBLK(st.st_mode)) {
 		fprintf(stderr, "%s is not a block device\n", path);
 		return -EINVAL;
 	}
 	*dev = st.st_rdev;
 	return 0;
 }

 static int snapshot_set_swap_area(int fd, dev_t dev, long long offset)
 {
 	struct resume_swap_area area = {
 		.offset = offset,
 		.dev = encode_dev(dev),
 	};

 	if (ioctl(fd, SNAPSHOT_SET_SWAP_AREA, &area) < 0)
 		return -errno;
 	return 0;
 }

 int main(int argc, char **argv)
 {
 	const char *p1, *p2, *pb;
 	dev_t d1, d2, db;
 	int fd, ret;
 	bool buggy = false;

 	if (argc != 4) {
 		fprintf(stderr,
 			"usage: %s <swap1> <swap2> <bogus>\n", argv[0]);
 		return 2;
 	}
 	if (geteuid() != 0) {
 		fprintf(stderr, "must run as root\n");
 		return 2;
 	}
 	p1 = argv[1]; p2 = argv[2]; pb = argv[3];

 	if (get_block_dev(p1, &d1) < 0 ||
 	    get_block_dev(p2, &d2) < 0 ||
 	    get_block_dev(pb, &db) < 0)
 		return 2;

 	fd = open("/dev/snapshot", O_WRONLY);
 	if (fd < 0) {
 		fprintf(stderr, "open(/dev/snapshot): %s\n", strerror(errno));
 		return 2;
 	}

 	ret = snapshot_set_swap_area(fd, d1, 0);
 	if (ret < 0) { fprintf(stderr, "step1: %s\n", strerror(-ret)); goto setup_err; }
 	printf("step1: pinned active swap %s\n", p1);

 	if (swapoff(p1) == 0) {
 		fprintf(stderr, "step2: swapoff unexpectedly succeeded\n");
 		close(fd); return 1;
 	}
 	if (errno != EBUSY) {
 		fprintf(stderr, "step2: expected EBUSY, got %s\n", strerror(errno));
 		goto setup_err;
 	}
 	printf("step2: swapoff blocked with EBUSY while pin is held\n");

 	ret = snapshot_set_swap_area(fd, d2, 0);
 	if (ret < 0) { fprintf(stderr, "step3: %s\n", strerror(-ret)); goto setup_err; }
 	printf("step3: repinned active swap to %s\n", p2);

 	if (swapoff(p1) < 0) {
 		fprintf(stderr, "step4: swapoff(%s): %s\n", p1, strerror(errno));
 		goto setup_err;
 	}
 	printf("step4: swapoff(%s) succeeded after repinning away\n", p1);

 	if (swapoff(p2) == 0) {
 		fprintf(stderr, "step5: swapoff unexpectedly succeeded\n");
 		close(fd); return 1;
 	}
 	if (errno != EBUSY) {
 		fprintf(stderr, "step5: expected EBUSY, got %s\n", strerror(errno));
 		goto setup_err;
 	}
 	printf("step5: repinned swap is blocked with EBUSY\n");

 	ret = snapshot_set_swap_area(fd, db, 0);
 	if (!ret) {
 		fprintf(stderr, "step6: bogus unexpectedly succeeded\n");
 		goto setup_err;
 	}
 	printf("step6: bogus SNAPSHOT_SET_SWAP_AREA failed as expected: %s\n",
 	       strerror(-ret));

 	if (swapoff(p2) == 0) {
 		printf("step7: swapoff(%s) succeeded after failed re-set\n", p2);
 		printf("result: pin was dropped on failure (old behavior)\n");
 		buggy = true;
 	} else if (errno == EBUSY) {
 		printf("step7: swapoff(%s) is still blocked with EBUSY\n", p2);
 		printf("result: pin preserved across failed re-set (hardened behavior)\n");
 	} else {
 		fprintf(stderr, "step7: unexpected: %s\n", strerror(errno));
 		goto setup_err;
 	}

 	close(fd);
 	if (!buggy) {
 		if (swapoff(p2) < 0) {
 			fprintf(stderr, "step8: swapoff(%s): %s\n", p2, strerror(errno));
 			return 2;
 		}
 		printf("step8: swapoff succeeded after closing /dev/snapshot\n");
 	}
 	printf("note: re-enable with `swapon %s` and `swapon %s`\n", p1, p2);
 	return buggy ? 1 : 0;

 setup_err:
 	close(fd);
 	return 2;
 }

 kernel/power/user.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/power/user.c b/kernel/power/user.c
index 4406f5644a56..e1ab85db2e95 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -218,6 +218,7 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
 {
 	sector_t offset;
 	dev_t swdev;
+	int type, swap;
 
 	if (swsusp_swap_in_use())
 		return -EPERM;
@@ -239,18 +240,34 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
 	}
 
 	/*
-	 * Unpin the swap device if a swap area was already
-	 * set by SNAPSHOT_SET_SWAP_AREA.
+	 * User space encodes device types as two-byte values, so we need to
+	 * recode them.
 	 */
-	unpin_hibernation_swap_type(data->swap);
+	type = find_hibernation_swap_type(swdev, offset);
+	if (type < 0)
+		return swdev ? -ENODEV : -EINVAL;
 
-	/*
-	 * User space encodes device types as two-byte values,
-	 * so we need to recode them
-	 */
-	data->swap = pin_hibernation_swap_type(swdev, offset);
-	if (data->swap < 0)
+	if (type == data->swap) {
+		/*
+		 * Re-selecting the already pinned swap area is a no-op.
+		 * Keep the existing pin and just refresh the cached device id.
+		 */
+		data->dev = swdev;
+		return 0;
+	}
+
+	swap = pin_hibernation_swap_type(swdev, offset);
+	if (swap < 0) {
+		/*
+		 * Preserve the existing pin on failure.  This can happen if the
+		 * target swap area disappears before pinning, or via the
+		 * defensive -EBUSY path in pin_hibernation_swap_type().
+		 */
 		return swdev ? -ENODEV : -EINVAL;
+	}
+
+	unpin_hibernation_swap_type(data->swap);
+	data->swap = swap;
 	data->dev = swdev;
 	return 0;
 }

base-commit: 5b2b0c6e457765adbe96fb2d464ff1bcd3d72158
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 16:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: YoungJun Park, Liam.Howlett, akpm, apopple, axelrasmussen, baohua,
	baolin.wang, bhe, byungchul, cgroups, chengming.zhou, chrisl,
	corbet, david, dev.jain, gourry, hannes, hughd, jannh,
	joshua.hahnjy, lance.yang, lenb, linux-doc, linux-kernel,
	linux-mm, linux-pm, lorenzo.stoakes, matthew.brost, mhocko,
	muchun.song, npache, pavel, peterx, peterz, pfalcato, rafael,
	rakie.kim, roman.gushchin, rppt, ryan.roberts, shakeel.butt,
	shikemeng, surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed,
	yuanchu, zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAMgjq7BO6SLZPfNXDh1F-7RAOqDAfqMQ4PM=qjAq1mCsWyD0LQ@mail.gmail.com>

On Mon, Apr 13, 2026 at 8:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Apr 14, 2026 at 11:05 AM YoungJun Park <youngjun.park@lge.com> wrote:
> >
>
> Hi All,
>
> > On Sat, Apr 11, 2026 at 06:40:44PM -0700, Nhat Pham wrote:
> > > > 1. Modularization
> > > >
> > > > You removed CONFIG_* and went with a unified approach. I recall
> > > > you were also considering a module-based structure at some point.
> > > > What are your thoughts on that direction?
> > > >
> > >
> > > The CONFIG-based approach was a huge mess. It makes me not want to
> > > look at the code, and I'm the author :)
> > >
> > > > If we take that approach, we could extend the recent swap ops
> > > > patchset (https://lore.kernel.org/linux-mm/20260302104016.163542-1-bhe@redhat.com/)
> > > > as follows:
> > > > - Make vswap a swap module
> > > > - Have cluster allocation functions reside in swapops
> > > > - Enable vswap through swapon
> > >
> > > Hmmmmm.
> >
> > I think this would be a happy world, but I wonder what others think.
> > Anyway, I'm looking forward to the future direction.
> >
>
> Yeah, I agree with this.
>
> And I do think swapoff of the virtual space itself is also necessary,
> we really need a failsafe, e.g. a clean way to drop the swap
> cache and data, kind of like drop_caches or shrinker fs are
> commonly used.
>
> > > > 2. Flash-friendly swap integration (for my use case)
> > > >
> > > > I've been thinking about the flash-friendly swap concept that
> > > > I mentioned before and recently proposed:
> > > > (https://lore.kernel.org/linux-mm/aZW0voL4MmnMQlaR@yjaykim-PowerEdge-T330/)
> > > >
> > > > One of its core functions requires buffering RAM-swapped pages
> > > > and writing them sequentially at an appropriate time -- not
> > > > immediately, but in proper block-sized units, sequentially.
> > > >
> > > > This means allocated offsets must essentially be virtual, and
> > > > physical offsets need to be managed separately at the actual
> > > > write time.
> > > >
> > > > If we integrate this into the current vswap, we would either
> > > > need vswap itself to handle the sequential writes (bypassing
> > > > the physical device and receiving pages directly), or swapon
> > > > a swap device and have vswap obtain physical offsets from it.
> > > > But since those offsets cannot be used directly (due to
> > > > buffering and sequential write requirements), they become
> > > > virtual too, resulting in:
> > > >
> > > >   virtual -> virtual -> physical
> > > >
> > > > This triple indirection is not ideal.
> > > >
> > > > However, if the modularization from point 1 is achieved and
> > > > vswap acts as a swap device itself, then we can cleanly
> > > > establish a:
> > > >
> > > >   virtual -> physical
> > >
> > > I read that thread sometimes ago. Some remarks:
> > >
> > > 1. I think Christoph has a point. Seems like some of your ideas ( are
> > > broadly applicable to swap in general. Maybe fixing swap infra
> > > generally would make a lot of sense?
> >
> > Broadly speaking, there are two main ideas:
> > 1. Swap I/O buffering (which is also tied to cluster management issues)
> > 2. Deduplication
> >
> > Are you leaning towards the view that these two should be placed in a
> > higher layer?
>
> IMHO the swap infra should be doing less, not more, so we can have
> more flexible design, and different backends can implement their own
> way to manage the data and layer. e.g. Having one backend being
> flash friendly and it can do this without caring or affecting other devices
> or backends.

I think that's what Youngjun already has, unless I misunderstand his
descriptions.

>
> > If it goes into ZSWAP, there would definitely be a clear advantage of
> > seeing dedup benefits across all swap devices. It's a technically
> > interesting area, and I'd like to discuss it in a separate thread if
> > I have more ideas or thoughts.
>
> Just branstorm... Why don't we just merge these identical pages like
> KSM? Maybe at least zero folios might benefit a lot if we keep them
> mapped as RO instead of recording them in swap, seems better in the
> long term?

That's our preferred approach too. We just didn't manage to get that
to work (yet). :)

^ permalink raw reply

* Re: [PATCH] PM: hibernate: preserve uswsusp swap pin across SNAPSHOT_SET_SWAP_AREA re-set failures
From: YoungJun Park @ 2026-04-14 16:18 UTC (permalink / raw)
  To: DaeMyung Kang
  Cc: Andrew Morton, Rafael J . Wysocki, Kairui Song, Chris Li,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Len Brown,
	Pavel Machek, linux-mm, linux-pm, linux-kernel, stable
In-Reply-To: <20260414143200.1267932-1-charsyam@gmail.com>

On Tue, Apr 14, 2026 at 11:32:00PM +0900, DaeMyung Kang wrote:

Hi Daemyung :)

> Commit 5b2b0c6e4577 ("mm/swap, PM: hibernate: fix swapoff
> race in uswsusp by pinning swap device") introduced
> SWP_HIBERNATION so that the swap device chosen via
> /dev/snapshot is held against swapoff for the entire uswsusp
> session. The intended invariant is: from the first successful
> SNAPSHOT_SET_SWAP_AREA until the /dev/snapshot fd is closed,
> exactly one swap device is pinned.
>
> snapshot_set_swap_area() breaks that invariant on the re-set
> path:
>
>       unpin_hibernation_swap_type(data->swap);
>       data->swap = pin_hibernation_swap_type(swdev, offset);
>       if (data->swap < 0)
>               return swdev ? -ENODEV : -EINVAL;
>
> The unpin happens unconditionally before the new pin is
> attempted. If the new pin fails (e.g. user space supplies an
> offset/device that is not an active swap area), the session
> continues with no swap device pinned, reopening exactly the
> swapoff race the original commit was meant to close. A
> subsequent swapoff on the previously selected device now
> succeeds where it would have been blocked with EBUSY.

Hmm.. This was actually intentional.  The API semantic
is that a second SNAPSHOT_SET_SWAP_AREA abandons the
previous pin.  If the new pin fails, the ioctl returns
an error and userspace is responsible for aborting the
session -- proceeding without a pinned device makes no
sense.

The only benefit of preserving the old pin on failure
would be protecting against userspace that ignores the
error.  But even in that case, the session has no valid
swap target, so the hibernation image write itself
would fail before swapoff becomes a concern.  I think
this is an edge case rather than a bug.

IOW, Looking at your test case, I think this part is
userspace's responsibility:

>       ret = snapshot_set_swap_area(fd, bogus_dev, 0);
>       if (!ret) {
>               fprintf(stderr,
>                       "step6: bogus SNAPSHOT_SET_SWAP_AREA unexpectedly succeeded\n");
>               close(fd);
>               return 2;
>       }

The ioctl has already returned an error here.  Userspace
should close /dev/snapshot and stop, not continue and
expect the old pin to still be in place.

(BTW, the tests are well written and easy to follow.
Thanks!)

For this patch to have real value, there should be
something that concretely breaks after the swapoff
succeeds.  But since the session has no valid swap
target at that point, is there any actual broken
behavior that follows?

>       if (!buggy) {
>               if (swapoff(swap_path_2) < 0) {
>                       fprintf(stderr,
>                               "step8: swapoff(%s) after close failed: %s\n",
>                               swap_path_2, strerror(errno));
>                       return 2;
>               }
>               printf("step8: swapoff succeeded after closing /dev/snapshot\n");
>       }

If you still see concrete value, I would be happy to
take this as an improvement (without Fixes: and
Cc: stable) -- see my suggestion below for a lighter
approach.

> Reordering pin/unpin in the caller cannot fix this
> cleanly. Each of pin_hibernation_swap_type() /
> unpin_hibernation_swap_type() acquires swap_lock
> independently, so any two-call sequence leaves a window
> in which swapoff can observe an inconsistent pin state.
> The same-area re-set case (type == old_type) also cannot
> be expressed with pin+unpin without either toggling the
> bit (racy) or returning EBUSY (a false error).
>
> Introduce repin_hibernation_swap_type(), which performs
> the transition atomically under a single swap_lock
> acquisition:

I understand the intent.  If you still see enough value
in preserving the pin on failure, I would suggest a
lighter approach -- see below.

> -     unpin_hibernation_swap_type(data->swap);
> -
> -     data->swap = pin_hibernation_swap_type(swdev, offset);
> -     if (data->swap < 0)
> +     swap = repin_hibernation_swap_type(data->swap, swdev,
> +                                        offset);
> +     if (swap < 0)
>               return swdev ? -ENODEV : -EINVAL;
> +     data->swap = swap;

Would it be simpler to use find_hibernation_swap_type()
to look up the new type first, and if it differs from
data->swap, call pin_hibernation_swap_type() on the new
one?  If pin succeeds, unpin the old one.  If it returns
-EBUSY, just keep the existing pin.

If swapoff sneaks in between find and pin, pin will
simply fail -- I don't think the kernel needs to
guarantee atomicity for that window.  This does acquire
swap_lock multiple times, but SNAPSHOT_SET_SWAP_AREA is
an extremely rare operation, so the extra lock
round-trips should be negligible.

Reusing the existing helpers seems preferable to adding
a new function with this much code for a single call
site.

How do you think?

Thanks again!
Youngjun Park

^ permalink raw reply

* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Eric Naim @ 2026-04-14 15:39 UTC (permalink / raw)
  To: Hanabishi, Thomas Gleixner, LKML
  Cc: Calvin Owens, Peter Zijlstra, Anna-Maria Behnsen,
	Frederic Weisbecker, Ingo Molnar, John Stultz, Stephen Boyd,
	Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
	Sebastian Reichel, linux-pm, Pablo Neira Ayuso, Florian Westphal,
	Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <68d1e9ac-2780-4be3-8ee3-0788062dd3a4@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1103 bytes --]

On 4/14/26 5:20 AM, Hanabishi wrote:
> 
> Hello.
> 
> Sorry, but this patch as of 7.0 introduced *severe* periodic lockups on my
> Ryzen 7700X machine.
> I see such messages in the log:
> 
> clocksource: Long readout interval, skipping watchdog check: cs_nsec:
> 2897344852 wd_nsec: 2897356996
> 
> Reverting d6e152d905bdb1f32f9d99775e2f453350399a6a for mainline fixes the
> issue for me.
> 

Hi maintainers,

several users from CachyOS has reported this regression as well. We landed on
the same bisection. One of the users that could reproduce this reliably
reproduced this just by watching a YouTube video in a browser, and observed
freezes and stutters when interacting with the system.

I had an LLM generate a fix (patch attached), and it fixed the regression for
that user. Full disclosure: it is written completely by AI, and I am also not
familiar with this subsystem. I just hope that this patch can be helpful in
fixing the regression.

Please don't hesitate to tell me off if utilizing AI in this way is not
helpful, so I can keep this in mind for future contributions.


-- 
Regards,
  Eric

[-- Attachment #2: ai.patch --]
[-- Type: text/x-patch, Size: 1283 bytes --]

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 38570998a19b..37b10045572e 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -332,8 +332,10 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
 	if (delta > (int64_t)dev->min_delta_ns) {
 		delta = min(delta, (int64_t) dev->max_delta_ns);
 		clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
-		if (!dev->set_next_event((unsigned long) clc, dev))
+		if (!dev->set_next_event((unsigned long) clc, dev)) {
+			dev->next_event_forced = 0;
 			return 0;
+		}
 	}
 
 	if (dev->next_event_forced)
diff --git a/kernel/time/tick-oneshot.c b/kernel/time/tick-oneshot.c
index 7472597f3225..bf411472d4f7 100644
--- a/kernel/time/tick-oneshot.c
+++ b/kernel/time/tick-oneshot.c
@@ -34,6 +34,7 @@ int tick_program_event(ktime_t expires, int force)
 		 */
 		clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT_STOPPED);
 		dev->next_event = KTIME_MAX;
+		dev->next_event_forced = 0;
 		return 0;
 	}
 
@@ -43,6 +44,7 @@ int tick_program_event(ktime_t expires, int force)
 		 * before using it.
 		 */
 		clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT);
+		dev->next_event_forced = 0;
 	}
 
 	return clockevents_program_event(dev, expires, force);

^ permalink raw reply related

* Re: [PATCH v2] cpufreq: Fix hotplug-suspend race during reboot
From: Zhongqiu Han @ 2026-04-14 14:44 UTC (permalink / raw)
  To: Tianxiang Chen, rafael; +Cc: viresh.kumar, lingyue, linux-pm, linux-kernel
In-Reply-To: <20260408141914.35281-1-nanmu@xiaomi.com>

On 4/8/2026 10:19 PM, Tianxiang Chen wrote:
> During system reboot, cpufreq_suspend() is called via the
> kernel_restart() -> device_shutdown() -> pm_notifier_call_chain()
> path. Unlike the normal system suspend path, the reboot path does not
> call freeze_processes(), so userspace processes and kernel threads
> remain active.
> 
> This allows CPU hotplug operations to run concurrently with
> cpufreq_suspend(). The original code has no synchronization with CPU
> hotplug, leading to a race condition where governor_data can be freed
> by the hotplug path while cpufreq_suspend() is still accessing it,
> resulting in a null pointer dereference:
> 
>    Unable to handle kernel NULL pointer dereference
>    Call Trace:
>     do_kernel_fault+0x28/0x3c
>     cpufreq_suspend+0xdc/0x160
>     device_shutdown+0x18/0x200
>     kernel_restart+0x40/0x80
>     arm64_sys_reboot+0x1b0/0x200
> 
> Fix this by adding cpus_read_lock()/cpus_read_unlock() to
> cpufreq_suspend() to block CPU hotplug operations while suspend is in
> progress.
> 
> Signed-off-by: Tianxiang Chen <nanmu@xiaomi.com>
> ---
> v2:
> - Update changelog to explicitly mention reboot scenario
> - Add observed crash trace
> ---
>   drivers/cpufreq/cpufreq.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 1f794524a1d9..6f1d264c378b 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -1979,6 +1979,7 @@ void cpufreq_suspend(void)
>   	if (!cpufreq_driver)
>   		return;
>   
> +	cpus_read_lock();
>   	if (!has_target() && !cpufreq_driver->suspend)
>   		goto suspend;
>   
> @@ -1998,6 +1999,7 @@ void cpufreq_suspend(void)
>   
>   suspend:
>   	cpufreq_suspended = true;
> +	cpus_read_unlock();
>   }
>   
>   /**

Hi Tianxiang,

May I know did you test this with lockdep enabled? Specifically, does
the new cpus_read_lock() → policy->rwsem ordering in cpufreq_suspend()
trigger any lockdep warnings? Thanks



-- 
Thx and BRs,
Zhongqiu Han

^ permalink raw reply

* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Pavlu @ 2026-04-14 14:33 UTC (permalink / raw)
  To: chensong_2000
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe, jikos, mbenes,
	pmladek, joe.lawrence, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260413080701.180976-1-chensong_2000@189.cn>

On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
> From: Song Chen <chensong_2000@189.cn>
> 
> ftrace and livepatch currently have their module load/unload callbacks
> hard-coded in the module loader as direct function calls to
> ftrace_module_enable(), klp_module_coming(), klp_module_going()
> and ftrace_release_mod(). This tight coupling was originally introduced
> to enforce strict call ordering that could not be guaranteed by the
> module notifier chain, which only supported forward traversal. Their
> notifiers were moved in and out back and forth. see [1] and [2].

I'm unclear about what is meant by the notifiers being moved back and
forth. The links point to patches that converted ftrace+klp from using
module notifiers to explicit callbacks due to ordering issues, but this
switch occurred only once. Have there been other attempts to use
notifiers again?

> 
> Now that the notifier chain supports reverse traversal via
> blocking_notifier_call_chain_reverse(), the ordering can be enforced
> purely through notifier priority. As a result, the module loader is now
> decoupled from the implementation details of ftrace and livepatch.
> What's more, adding a new subsystem with symmetric setup/teardown ordering
> requirements during module load/unload no longer requires modifying
> kernel/module/main.c; it only needs to register a notifier_block with an
> appropriate priority.
> 
> [1]:https://lore.kernel.org/all/
> 	alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/
> [2]:https://lore.kernel.org/all/
> 	20160301030034.GC12120@packer-debian-8-amd64.digitalocean.com/

Nit: Avoid wrapping URLs, as it breaks autolinking and makes the links
harder to copy.

Better links would be:
[1] https://lore.kernel.org/all/1455661953-15838-1-git-send-email-jeyu@redhat.com/
[2] https://lore.kernel.org/all/1458176139-17455-1-git-send-email-jeyu@redhat.com/

The first link is the final version of what landed as commit
7dcd182bec27 ("ftrace/module: remove ftrace module notifier"). The
second is commit 7e545d6eca20 ("livepatch/module: remove livepatch
module notifier").

> 
> Signed-off-by: Song Chen <chensong_2000@189.cn>
> ---
>  include/linux/module.h  |  8 ++++++++
>  kernel/livepatch/core.c | 29 ++++++++++++++++++++++++++++-
>  kernel/module/main.c    | 34 +++++++++++++++-------------------
>  kernel/trace/ftrace.c   | 38 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 89 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 14f391b186c6..0bdd56f9defd 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -308,6 +308,14 @@ enum module_state {
>  	MODULE_STATE_COMING,	/* Full formed, running module_init. */
>  	MODULE_STATE_GOING,	/* Going away. */
>  	MODULE_STATE_UNFORMED,	/* Still setting it up. */
> +	MODULE_STATE_FORMED,

I don't see a reason to add a new module state. Why is it necessary and
how does it fit with the existing states?

> +};
> +
> +enum module_notifier_prio {
> +	MODULE_NOTIFIER_PRIO_LOW = INT_MIN,	/* Low prioroty, coming last, going first */
> +	MODULE_NOTIFIER_PRIO_MID = 0,	/* Normal priority. */
> +	MODULE_NOTIFIER_PRIO_SECOND_HIGH = INT_MAX - 1,	/* Second high priorigy, coming second*/
> +	MODULE_NOTIFIER_PRIO_HIGH = INT_MAX,	/* High priorigy, coming first, going late. */

I suggest being explicit about how the notifiers are ordered. For
example:

enum module_notifier_prio {
	MODULE_NOTIFIER_PRIO_NORMAL,	/* Normal priority, coming last, going first. */
	MODULE_NOTIFIER_PRIO_LIVEPATCH,
	MODULE_NOTIFIER_PRIO_FTRACE,	/* High priority, coming first, going late. */
};

>  };
>  
>  struct mod_tree_node {
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index 28d15ba58a26..ce78bb23e24b 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -1375,13 +1375,40 @@ void *klp_find_section_by_name(const struct module *mod, const char *name,
>  }
>  EXPORT_SYMBOL_GPL(klp_find_section_by_name);
>  
> +static int klp_module_callback(struct notifier_block *nb, unsigned long op,
> +			void *module)
> +{
> +	struct module *mod = module;
> +	int err = 0;
> +
> +	switch (op) {
> +	case MODULE_STATE_COMING:
> +		err = klp_module_coming(mod);
> +		break;
> +	case MODULE_STATE_LIVE:
> +		break;
> +	case MODULE_STATE_GOING:
> +		klp_module_going(mod);
> +		break;
> +	default:
> +		break;
> +	}

klp_module_coming() and klp_module_going() are now used only in
kernel/livepatch/core.c where they are also defined. This means the
functions can be static and their declarations removed from
include/linux/livepatch.h.

Nit: The MODULE_STATE_LIVE and default cases in the switch can be
removed.

> +
> +	return notifier_from_errno(err);
> +}
> +
> +static struct notifier_block klp_module_nb = {
> +	.notifier_call = klp_module_callback,
> +	.priority = MODULE_NOTIFIER_PRIO_SECOND_HIGH
> +};
> +
>  static int __init klp_init(void)
>  {
>  	klp_root_kobj = kobject_create_and_add("livepatch", kernel_kobj);
>  	if (!klp_root_kobj)
>  		return -ENOMEM;
>  
> -	return 0;
> +	return register_module_notifier(&klp_module_nb);
>  }
>  
>  module_init(klp_init);
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index c3ce106c70af..226dd5b80997 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -833,10 +833,8 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
>  	/* Final destruction now no one is using it. */
>  	if (mod->exit != NULL)
>  		mod->exit();
> -	blocking_notifier_call_chain(&module_notify_list,
> +	blocking_notifier_call_chain_reverse(&module_notify_list,
>  				     MODULE_STATE_GOING, mod);
> -	klp_module_going(mod);
> -	ftrace_release_mod(mod);
>  
>  	async_synchronize_full();
>  
> @@ -3135,10 +3133,8 @@ static noinline int do_init_module(struct module *mod)
>  	mod->state = MODULE_STATE_GOING;
>  	synchronize_rcu();
>  	module_put(mod);
> -	blocking_notifier_call_chain(&module_notify_list,
> +	blocking_notifier_call_chain_reverse(&module_notify_list,
>  				     MODULE_STATE_GOING, mod);
> -	klp_module_going(mod);
> -	ftrace_release_mod(mod);
>  	free_module(mod);
>  	wake_up_all(&module_wq);
>  

The patch unexpectedly leaves a call to ftrace_free_mem() in
do_init_module().

> @@ -3281,20 +3277,14 @@ static int complete_formation(struct module *mod, struct load_info *info)
>  	return err;
>  }
>  
> -static int prepare_coming_module(struct module *mod)
> +static int prepare_module_state_transaction(struct module *mod,
> +			unsigned long val_up, unsigned long val_down)
>  {
>  	int err;
>  
> -	ftrace_module_enable(mod);
> -	err = klp_module_coming(mod);
> -	if (err)
> -		return err;
> -
>  	err = blocking_notifier_call_chain_robust(&module_notify_list,
> -			MODULE_STATE_COMING, MODULE_STATE_GOING, mod);
> +			val_up, val_down, mod);
>  	err = notifier_to_errno(err);
> -	if (err)
> -		klp_module_going(mod);
>  
>  	return err;
>  }
> @@ -3468,14 +3458,21 @@ static int load_module(struct load_info *info, const char __user *uargs,
>  	init_build_id(mod, info);
>  
>  	/* Ftrace init must be called in the MODULE_STATE_UNFORMED state */
> -	ftrace_module_init(mod);
> +	err = prepare_module_state_transaction(mod,
> +				MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);

I believe val_down should be MODULE_STATE_GOING to reverse the
operation. Why is the new state MODULE_STATE_FORMED needed here?

> +	if (err)
> +		goto ddebug_cleanup;
>  
>  	/* Finally it's fully formed, ready to start executing. */
>  	err = complete_formation(mod, info);
> -	if (err)
> +	if (err) {
> +		blocking_notifier_call_chain_reverse(&module_notify_list,
> +				MODULE_STATE_FORMED, mod);
>  		goto ddebug_cleanup;
> +	}
>  
> -	err = prepare_coming_module(mod);
> +	err = prepare_module_state_transaction(mod,
> +				MODULE_STATE_COMING, MODULE_STATE_GOING);
>  	if (err)
>  		goto bug_cleanup;
>  
> @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
>  	destroy_params(mod->kp, mod->num_kp);
>  	blocking_notifier_call_chain(&module_notify_list,
>  				     MODULE_STATE_GOING, mod);

My understanding is that all notifier chains for MODULE_STATE_GOING
should be reversed.

> -	klp_module_going(mod);
>   bug_cleanup:
>  	mod->state = MODULE_STATE_GOING;
>  	/* module_bug_cleanup needs module_mutex protection */

The patch removes the klp_module_going() cleanup call in load_module().
Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
should be removed and appropriately replaced with a cleanup via
a notifier.

> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 8df69e702706..efedb98d3db4 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -5241,6 +5241,44 @@ static int __init ftrace_mod_cmd_init(void)
>  }
>  core_initcall(ftrace_mod_cmd_init);
>  
> +static int ftrace_module_callback(struct notifier_block *nb, unsigned long op,
> +			void *module)
> +{
> +	struct module *mod = module;
> +
> +	switch (op) {
> +	case MODULE_STATE_UNFORMED:
> +		ftrace_module_init(mod);
> +		break;
> +	case MODULE_STATE_COMING:
> +		ftrace_module_enable(mod);
> +		break;
> +	case MODULE_STATE_LIVE:
> +		ftrace_free_mem(mod, mod->mem[MOD_INIT_TEXT].base,
> +				mod->mem[MOD_INIT_TEXT].base + mod->mem[MOD_INIT_TEXT].size);
> +		break;
> +	case MODULE_STATE_GOING:
> +	case MODULE_STATE_FORMED:
> +		ftrace_release_mod(mod);
> +		break;
> +	default:
> +		break;
> +	}

ftrace_module_init(), ftrace_module_enable(), ftrace_free_mem() and
ftrace_release_mod() should be newly used only in kernel/trace/ftrace.c
where they are also defined. The functions can then be made static and
removed from include/linux/ftrace.h.

Nit: The default case in the switch can be removed.

> +
> +	return notifier_from_errno(0);

Nit: This can be simply "return NOTIFY_OK;".

> +}
> +
> +static struct notifier_block ftrace_module_nb = {
> +	.notifier_call = ftrace_module_callback,
> +	.priority = MODULE_NOTIFIER_PRIO_HIGH
> +};
> +
> +static int __init ftrace_register_module_notifier(void)
> +{
> +	return register_module_notifier(&ftrace_module_nb);
> +}
> +core_initcall(ftrace_register_module_notifier);
> +
>  static void function_trace_probe_call(unsigned long ip, unsigned long parent_ip,
>  				      struct ftrace_ops *op, struct ftrace_regs *fregs)
>  {

-- 
Thanks,
Petr

^ permalink raw reply

* [PATCH] PM: hibernate: preserve uswsusp swap pin across SNAPSHOT_SET_SWAP_AREA re-set failures
From: DaeMyung Kang @ 2026-04-14 14:32 UTC (permalink / raw)
  To: Andrew Morton, Rafael J . Wysocki
  Cc: Youngjun Park, Kairui Song, Chris Li, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Len Brown, Pavel Machek, linux-mm,
	linux-pm, linux-kernel, DaeMyung Kang, stable

Commit 5b2b0c6e4577 ("mm/swap, PM: hibernate: fix swapoff race in uswsusp
by pinning swap device") introduced SWP_HIBERNATION so that the swap
device chosen via /dev/snapshot is held against swapoff for the entire
uswsusp session. The intended invariant is: from the first successful
SNAPSHOT_SET_SWAP_AREA until the /dev/snapshot fd is closed, exactly one
swap device is pinned.

snapshot_set_swap_area() breaks that invariant on the re-set path:

	unpin_hibernation_swap_type(data->swap);
	data->swap = pin_hibernation_swap_type(swdev, offset);
	if (data->swap < 0)
		return swdev ? -ENODEV : -EINVAL;

The unpin happens unconditionally before the new pin is attempted. If
the new pin fails (e.g. user space supplies an offset/device that is not
an active swap area), the session continues with no swap device pinned,
reopening exactly the swapoff race the original commit was meant to
close. A subsequent swapoff on the previously selected device now
succeeds where it would have been blocked with EBUSY.

As a secondary consequence, data->swap is overwritten with the negative
error return from pin_hibernation_swap_type(). The value is harmless at
close time (swap_type_to_info() on the invalid type returns NULL, so the
release-side unpin is a no-op and there is no pin to leak), but leaving
a negative sentinel in data->swap for the rest of the session is still
a state-hygiene defect: any future reader of data->swap cannot
distinguish it from a never-set session.

The bug is observable with ioctls alone; it does not require an actual
hibernation cycle. A user-space caller that supplies one valid and then
one invalid resume_swap_area is enough to strand the session without a
pin.

Reordering pin/unpin in the caller cannot fix this cleanly. Each of
pin_hibernation_swap_type() / unpin_hibernation_swap_type() acquires
swap_lock independently, so any two-call sequence leaves a window in
which swapoff can observe an inconsistent pin state. The same-area
re-set case (type == old_type) also cannot be expressed with pin+unpin
without either toggling the bit (racy) or returning EBUSY (a false
error).

Introduce repin_hibernation_swap_type(), which performs the transition
atomically under a single swap_lock acquisition:

  - verify that old_type, if held, still carries SWP_HIBERNATION;
  - look up the new swap area;
  - if it is the same as old_type, return without touching any flags;
  - otherwise clear SWP_HIBERNATION on the old si and set it on the
    new si within the same critical section;
  - on any failure, return without modifying either si's flags, so the
    previous pin is preserved.

Update snapshot_set_swap_area() to use the new helper and to stage the
result in a local variable, committing to data->swap only on success.
This closes the protection-loss window and also avoids the data->swap
corruption on failure.

Fixes: 5b2b0c6e4577 ("mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device")
Cc: stable@vger.kernel.org
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
---
Notes (not part of the commit, stripped by git am):

Baseline
--------
This patch is generated against linux-next at commit 5b2b0c6e4577
("mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap
device"). Mainline does not yet carry that commit, and neither the
helpers it introduces (pin/unpin_hibernation_swap_type) nor the code
site this patch modifies exist there. The base-commit trailer at the
bottom of the mbox records the exact commit.

Testing
-------
The bug does not require an actual hibernation cycle. The ioctl path
alone is enough to re-open the swapoff race. A targeted reproducer is
included below; run it as root in a throwaway VM with two active swap
block devices and one non-swap block device (three arguments).

Run inside a VM on linux-next at 5b2b0c6e4577 with this patch applied:

  step1: pinned active swap /dev/vda
  step2: swapoff blocked with EBUSY while pin is held
  step3: repinned active swap to /dev/vdb
  step4: swapoff(/dev/vda) succeeded after repinning away
  step5: repinned swap is blocked with EBUSY
  step6: bogus SNAPSHOT_SET_SWAP_AREA failed as expected: No such device
  step7: swapoff(/dev/vdb) is still blocked with EBUSY
  result: FIXED kernel, hibernation pin was preserved
  step8: swapoff succeeded after closing /dev/snapshot

Run on the same tree without this patch applied: step7 instead reports
"swapoff(/dev/vdb) succeeded after failed re-set" and the program exits
with status 1 ("BUGGY kernel, hibernation pin was dropped").

What the reproducer covers:
  - SWP_HIBERNATION is actually enforced against swapoff (step2, step5);
  - the success path of repin_hibernation_swap_type() atomically moves
    the pin from one active swap to another (step3, step4, step5);
  - the failure path of repin_hibernation_swap_type() preserves the
    existing pin (step6, step7);
  - the pin lifetime ends on /dev/snapshot close (step8).

What it does not cover:
  - snapshot_open(O_RDONLY) initial resume-device pin path;
  - the full suspend-to-disk image create/restore flow;
  - concurrent swapoff racing against SNAPSHOT_SET_SWAP_AREA;
  - the type == old_type idempotent branch (not externally observable).

A normal sysfs-based suspend-to-disk cycle continues to work; the
find_hibernation_swap_type() path is unchanged. Build tested with
allmodconfig and run-tested with CONFIG_PROVE_LOCKING=y and
CONFIG_KASAN=y. The VM was booted with oops=panic panic=-1 so any
WARN/Oops/BUG would have halted the run; the full test completed
cleanly with no kernel log diagnostics, including the three
WARN_ON_ONCE() invariant checks inside repin_hibernation_swap_type().

Reproducer (C source, for reference only -- not added to the tree):

 // SPDX-License-Identifier: GPL-2.0
 /*
  * Reproduce the uswsusp SNAPSHOT_SET_SWAP_AREA pin lifetime regression.
  *
  * This targets the bug introduced after hibernation swap pinning was added:
  * a failed SNAPSHOT_SET_SWAP_AREA() could drop the existing pin, letting a
  * subsequent swapoff() succeed while /dev/snapshot was still open.
  *
  * Run only inside a throwaway VM. The test manipulates swap state and leaves
  * the target swap area disabled on success.
  */
 
 #define _GNU_SOURCE
 #include <errno.h>
 #include <fcntl.h>
 #include <linux/types.h>
 #include <linux/suspend_ioctls.h>
 #include <stdbool.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <sys/ioctl.h>
 #include <sys/stat.h>
 #include <sys/swap.h>
 #include <sys/sysmacros.h>
 #include <unistd.h>
 
 static void print_usage(const char *prog)
 {
 	fprintf(stderr,
 		"usage: %s <active-swap-dev-1> <active-swap-dev-2> <bogus-block-dev>\n"
 		"  <active-swap-dev-1> must be an active swap block device.\n"
 		"  <active-swap-dev-2> must be a second active swap block device.\n"
 		"  <bogus-block-dev> must be a block device that is not a swap area.\n",
 		prog);
 }
 
 static int encode_dev(dev_t dev)
 {
 	unsigned int major_num = major(dev);
 	unsigned int minor_num = minor(dev);
 
 	/*
 	 * Match the kernel's new_encode_dev() layout; SNAPSHOT_SET_SWAP_AREA
 	 * decodes this with new_decode_dev() on the kernel side.
 	 */
 	return (major_num & 0xfff) << 8 |
 	       (minor_num & 0xff) |
 	       ((minor_num & ~0xff) << 12);
 }
 
 static int get_block_dev(const char *path, dev_t *dev)
 {
 	struct stat st;
 
 	if (stat(path, &st) < 0) {
 		fprintf(stderr, "stat(%s): %s\n", path, strerror(errno));
 		return -errno;
 	}
 
 	if (!S_ISBLK(st.st_mode)) {
 		fprintf(stderr, "%s is not a block device\n", path);
 		return -EINVAL;
 	}
 
 	*dev = st.st_rdev;
 	return 0;
 }
 
 static int snapshot_set_swap_area(int fd, dev_t dev, long long offset)
 {
 	struct resume_swap_area area = {
 		.offset = offset,
 		.dev = encode_dev(dev),
 	};
 
 	if (ioctl(fd, SNAPSHOT_SET_SWAP_AREA, &area) < 0)
 		return -errno;
 	return 0;
 }
 
 int main(int argc, char **argv)
 {
 	const char *swap_path_1, *swap_path_2, *bogus_path;
 	dev_t swap_dev_1, swap_dev_2, bogus_dev;
 	int fd, ret;
 	bool buggy = false;
 
 	if (argc != 4) {
 		print_usage(argv[0]);
 		return 2;
 	}
 
 	if (geteuid() != 0) {
 		fprintf(stderr, "must run as root\n");
 		return 2;
 	}
 
 	swap_path_1 = argv[1];
 	swap_path_2 = argv[2];
 	bogus_path = argv[3];
 
 	ret = get_block_dev(swap_path_1, &swap_dev_1);
 	if (ret < 0)
 		return 2;
 
 	ret = get_block_dev(swap_path_2, &swap_dev_2);
 	if (ret < 0)
 		return 2;
 
 	ret = get_block_dev(bogus_path, &bogus_dev);
 	if (ret < 0)
 		return 2;
 
 	fd = open("/dev/snapshot", O_WRONLY);
 	if (fd < 0) {
 		fprintf(stderr, "open(/dev/snapshot): %s\n", strerror(errno));
 		return 2;
 	}
 
 	ret = snapshot_set_swap_area(fd, swap_dev_1, 0);
 	if (ret < 0) {
 		fprintf(stderr, "step1: valid SNAPSHOT_SET_SWAP_AREA failed: %s\n",
 			strerror(-ret));
 		close(fd);
 		return 2;
 	}
 	printf("step1: pinned active swap %s\n", swap_path_1);
 
 	if (swapoff(swap_path_1) == 0) {
 		fprintf(stderr,
 			"step2: swapoff(%s) unexpectedly succeeded while pinned\n",
 			swap_path_1);
 		close(fd);
 		return 1;
 	}
 	if (errno != EBUSY) {
 		fprintf(stderr,
 			"step2: swapoff(%s) failed with %s, expected EBUSY\n",
 			swap_path_1, strerror(errno));
 		close(fd);
 		return 2;
 	}
 	printf("step2: swapoff blocked with EBUSY while pin is held\n");
 
 	ret = snapshot_set_swap_area(fd, swap_dev_2, 0);
 	if (ret < 0) {
 		fprintf(stderr,
 			"step3: second valid SNAPSHOT_SET_SWAP_AREA failed: %s\n",
 			strerror(-ret));
 		close(fd);
 		return 2;
 	}
 	printf("step3: repinned active swap to %s\n", swap_path_2);
 
 	if (swapoff(swap_path_1) < 0) {
 		fprintf(stderr,
 			"step4: swapoff(%s) failed after repin: %s\n",
 			swap_path_1, strerror(errno));
 		close(fd);
 		return 2;
 	}
 	printf("step4: swapoff(%s) succeeded after repinning away\n",
 	       swap_path_1);
 
 	if (swapoff(swap_path_2) == 0) {
 		fprintf(stderr,
 			"step5: swapoff(%s) unexpectedly succeeded while pinned\n",
 			swap_path_2);
 		close(fd);
 		return 1;
 	}
 	if (errno != EBUSY) {
 		fprintf(stderr,
 			"step5: swapoff(%s) failed with %s, expected EBUSY\n",
 			swap_path_2, strerror(errno));
 		close(fd);
 		return 2;
 	}
 	printf("step5: repinned swap is blocked with EBUSY\n");
 
 	ret = snapshot_set_swap_area(fd, bogus_dev, 0);
 	if (!ret) {
 		fprintf(stderr,
 			"step6: bogus SNAPSHOT_SET_SWAP_AREA unexpectedly succeeded\n");
 		close(fd);
 		return 2;
 	}
 	printf("step6: bogus SNAPSHOT_SET_SWAP_AREA failed as expected: %s\n",
 	       strerror(-ret));
 
 	if (swapoff(swap_path_2) == 0) {
 		printf("step7: swapoff(%s) succeeded after failed re-set\n",
 		       swap_path_2);
 		printf("result: BUGGY kernel, hibernation pin was dropped\n");
 		buggy = true;
 	} else if (errno == EBUSY) {
 		printf("step7: swapoff(%s) is still blocked with EBUSY\n",
 		       swap_path_2);
 		printf("result: FIXED kernel, hibernation pin was preserved\n");
 	} else {
 		fprintf(stderr, "step7: unexpected swapoff(%s) error: %s\n",
 			swap_path_2, strerror(errno));
 		close(fd);
 		return 2;
 	}
 
 	close(fd);
 
 	if (!buggy) {
 		if (swapoff(swap_path_2) < 0) {
 			fprintf(stderr,
 				"step8: swapoff(%s) after close failed: %s\n",
 				swap_path_2, strerror(errno));
 			return 2;
 		}
 		printf("step8: swapoff succeeded after closing /dev/snapshot\n");
 	}
 
 	printf("note: re-enable swap with `swapon %s` and `swapon %s`\n",
 	       swap_path_1, swap_path_2);
 	return buggy ? 1 : 0;
 }


 include/linux/swap.h |  1 +
 kernel/power/user.c  | 12 +++------
 mm/swapfile.c        | 61 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1930f81e6be4..720347ae8ce1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -435,6 +435,7 @@ static inline long get_nr_swap_pages(void)
 
 extern void si_swapinfo(struct sysinfo *);
 extern int pin_hibernation_swap_type(dev_t device, sector_t offset);
+extern int repin_hibernation_swap_type(int old_type, dev_t device, sector_t offset);
 extern void unpin_hibernation_swap_type(int type);
 extern int find_hibernation_swap_type(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
diff --git a/kernel/power/user.c b/kernel/power/user.c
index 4406f5644a56..869371ad4a5f 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -218,6 +218,7 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
 {
 	sector_t offset;
 	dev_t swdev;
+	int swap;
 
 	if (swsusp_swap_in_use())
 		return -EPERM;
@@ -238,19 +239,14 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
 		offset = swap_area.offset;
 	}
 
-	/*
-	 * Unpin the swap device if a swap area was already
-	 * set by SNAPSHOT_SET_SWAP_AREA.
-	 */
-	unpin_hibernation_swap_type(data->swap);
-
 	/*
 	 * User space encodes device types as two-byte values,
 	 * so we need to recode them
 	 */
-	data->swap = pin_hibernation_swap_type(swdev, offset);
-	if (data->swap < 0)
+	swap = repin_hibernation_swap_type(data->swap, swdev, offset);
+	if (swap < 0)
 		return swdev ? -ENODEV : -EINVAL;
+	data->swap = swap;
 	data->dev = swdev;
 	return 0;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c5b459a18f43..4d3b41125e6a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2215,6 +2215,67 @@ int pin_hibernation_swap_type(dev_t device, sector_t offset)
 	return type;
 }
 
+/**
+ * repin_hibernation_swap_type - Retarget a hibernation pin without dropping it
+ * @old_type: Currently pinned swap type, or a negative value if none is pinned
+ * @device: Block device containing the resume image
+ * @offset: Offset identifying the swap area
+ *
+ * Locate the swap device for @device/@offset and make it the hibernation-pinned
+ * device. If @old_type already refers to the same swap area, the existing pin
+ * is kept. On failure, the previous pin is preserved.
+ *
+ * Return:
+ * >= 0 on success (new swap type).
+ * -EINVAL if @device is invalid.
+ * -ENODEV if the swap device is not found.
+ * -EBUSY if another device is already pinned for hibernation.
+ */
+int repin_hibernation_swap_type(int old_type, dev_t device, sector_t offset)
+{
+	int type;
+	struct swap_info_struct *old_si = NULL, *new_si;
+
+	spin_lock(&swap_lock);
+
+	if (old_type >= 0) {
+		old_si = swap_type_to_info(old_type);
+		if (WARN_ON_ONCE(!old_si || !(old_si->flags & SWP_HIBERNATION))) {
+			spin_unlock(&swap_lock);
+			return -EINVAL;
+		}
+	}
+
+	type = __find_hibernation_swap_type(device, offset);
+	if (type < 0) {
+		spin_unlock(&swap_lock);
+		return type;
+	}
+
+	if (type == old_type) {
+		spin_unlock(&swap_lock);
+		return type;
+	}
+
+	new_si = swap_type_to_info(type);
+	if (WARN_ON_ONCE(!new_si)) {
+		spin_unlock(&swap_lock);
+		return -ENODEV;
+	}
+
+	if (WARN_ON_ONCE(new_si->flags & SWP_HIBERNATION)) {
+		spin_unlock(&swap_lock);
+		return -EBUSY;
+	}
+
+	if (old_si)
+		old_si->flags &= ~SWP_HIBERNATION;
+	new_si->flags |= SWP_HIBERNATION;
+
+	spin_unlock(&swap_lock);
+	return type;
+}
+
 /**
  * unpin_hibernation_swap_type - Unpin the swap device for hibernation
  * @type: Swap type previously returned by pin_hibernation_swap_type()

base-commit: 5b2b0c6e457765adbe96fb2d464ff1bcd3d72158
-- 
2.43.0


^ permalink raw reply related

* Re: [patch V2 11/11] alarmtimer: Remove unused interfaces
From: Frederic Weisbecker @ 2026-04-14 14:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, John Stultz, Stephen Boyd, Calvin Owens, Anna-Maria Behnsen,
	Peter Zijlstra (Intel), Alexander Viro, Christian Brauner,
	Jan Kara, linux-fsdevel, Sebastian Reichel, linux-pm,
	Pablo Neira Ayuso, Florian Westphal, Phil Sutter, netfilter-devel,
	coreteam
In-Reply-To: <20260408114952.670899355@kernel.org>

Le Wed, Apr 08, 2026 at 01:54:33PM +0200, Thomas Gleixner a écrit :
> All alarmtimer users are converted to alarm_start_timer(). Remove the now
> unused interfaces.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: John Stultz <jstultz@google.com>
> Cc: Stephen Boyd <sboyd@kernel.org>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox