* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Hanabishi @ 2026-04-14 21:35 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Frederic Weisbecker, Eric Naim, LKML, Calvin Owens,
Peter Zijlstra, Anna-Maria Behnsen, Ingo Molnar, John Stultz,
Stephen Boyd, Alexander Viro, Christian Brauner, Jan Kara,
linux-fsdevel, Sebastian Reichel, linux-pm, Pablo Neira Ayuso,
Florian Westphal, Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <87340xfeje.ffs@tglx>
On 14/04/2026 20:55, Thomas Gleixner wrote:
> The one below should cover all possible holes.
>
> Thanks,
>
> tglx
> ---
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index b4d730604972..5e22697b098d 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -94,6 +94,9 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
> if (dev->features & CLOCK_EVT_FEAT_DUMMY)
> return 0;
>
> + /* On state transitions clear the forced flag unconditionally */
> + dev->next_event_forced = 0;
> +
> /* Transition with new state-specific callbacks */
> switch (state) {
> case CLOCK_EVT_STATE_DETACHED:
> @@ -366,8 +369,10 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, b
> if (delta > (int64_t)dev->min_delta_ns) {
> delta = min(delta, (int64_t) dev->max_delta_ns);
> cycles = ((u64)delta * dev->mult) >> dev->shift;
> - if (!dev->set_next_event((unsigned long) cycles, dev))
> + if (!dev->set_next_event((unsigned long) cycles, dev)) {
> + dev->next_event_forced = 0;
> return 0;
> + }
> }
>
> if (dev->next_event_forced)
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index 7e57fa31ee26..115e0bf01276 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -108,6 +108,7 @@ static struct clock_event_device *tick_get_oneshot_wakeup_device(int cpu)
>
> static void tick_oneshot_wakeup_handler(struct clock_event_device *wd)
> {
> + wd->next_event_forced = 0;
> /*
> * If we woke up early and the tick was reprogrammed in the
> * meantime then this may be spurious but harmless.
Ok, it does fix the problem! Thank you.
The patch itself does not apply cleanly for 7.0 though and I had to adapt it a bit.
^ permalink raw reply
* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Thomas Gleixner @ 2026-04-14 20:55 UTC (permalink / raw)
To: Hanabishi, Frederic Weisbecker
Cc: Eric Naim, LKML, Calvin Owens, Peter Zijlstra, Anna-Maria Behnsen,
Ingo Molnar, John Stultz, Stephen Boyd, Alexander Viro,
Christian Brauner, Jan Kara, linux-fsdevel, Sebastian Reichel,
linux-pm, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
netfilter-devel, coreteam
In-Reply-To: <a3ac856c-914c-4b39-949f-634bed501e7c@gmail.com>
On Tue, Apr 14 2026 at 18:25, Hanabishi wrote:
> On 14/04/2026 18:04, Frederic Weisbecker wrote:
>
> This patch doesn't help me unfortunately. Thanks.
The one below should cover all possible holes.
Thanks,
tglx
---
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index b4d730604972..5e22697b098d 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -94,6 +94,9 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
if (dev->features & CLOCK_EVT_FEAT_DUMMY)
return 0;
+ /* On state transitions clear the forced flag unconditionally */
+ dev->next_event_forced = 0;
+
/* Transition with new state-specific callbacks */
switch (state) {
case CLOCK_EVT_STATE_DETACHED:
@@ -366,8 +369,10 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, b
if (delta > (int64_t)dev->min_delta_ns) {
delta = min(delta, (int64_t) dev->max_delta_ns);
cycles = ((u64)delta * dev->mult) >> dev->shift;
- if (!dev->set_next_event((unsigned long) cycles, dev))
+ if (!dev->set_next_event((unsigned long) cycles, dev)) {
+ dev->next_event_forced = 0;
return 0;
+ }
}
if (dev->next_event_forced)
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 7e57fa31ee26..115e0bf01276 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -108,6 +108,7 @@ static struct clock_event_device *tick_get_oneshot_wakeup_device(int cpu)
static void tick_oneshot_wakeup_handler(struct clock_event_device *wd)
{
+ wd->next_event_forced = 0;
/*
* If we woke up early and the tick was reprogrammed in the
* meantime then this may be spurious but harmless.
^ permalink raw reply related
* Re: [PATCH 3/3] pmdomain: qcom: rpmhpd: Add power domains for Nord SoC
From: Dmitry Baryshkov @ 2026-04-14 19:27 UTC (permalink / raw)
To: Shawn Guo
Cc: Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Bjorn Andersson, Konrad Dybcio, Kamal Wadhwa, Taniya Das,
Bartosz Golaszewski, Deepti Jaggi, linux-arm-msm, linux-pm,
devicetree, linux-kernel
In-Reply-To: <20260414035909.652992-4-shengchao.guo@oss.qualcomm.com>
On Tue, Apr 14, 2026 at 11:59:09AM +0800, Shawn Guo wrote:
> From: Kamal Wadhwa <kamal.wadhwa@oss.qualcomm.com>
>
> Add RPMh power domains required for Nord SoC. This includes
> new definitions for power domains supplying GFX1 and NSP3 subsystem.
>
> Co-developed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> Signed-off-by: Kamal Wadhwa <kamal.wadhwa@oss.qualcomm.com>
> Signed-off-by: Shawn Guo <shengchao.guo@oss.qualcomm.com>
> ---
> drivers/pmdomain/qcom/rpmhpd.c | 35 ++++++++++++++++++++++++++++++++++
> 1 file changed, 35 insertions(+)
>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
--
With best wishes
Dmitry
^ permalink raw reply
* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Hanabishi @ 2026-04-14 18:25 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Eric Naim, Thomas Gleixner, LKML, Calvin Owens, Peter Zijlstra,
Anna-Maria Behnsen, Ingo Molnar, John Stultz, Stephen Boyd,
Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
Sebastian Reichel, linux-pm, Pablo Neira Ayuso, Florian Westphal,
Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <ad6BtKRj1GyreNCS@localhost.localdomain>
On 14/04/2026 18:04, Frederic Weisbecker wrote:
> Can you try the following?
>
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index b4d730604972..5c6dfd6bed28 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -100,6 +100,7 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
> /* The clockevent device is getting replaced. Shut it down. */
>
> case CLOCK_EVT_STATE_SHUTDOWN:
> + dev->next_event_forced = 0;
> if (dev->set_state_shutdown)
> return dev->set_state_shutdown(dev);
> return 0;
> @@ -127,10 +128,12 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
> clockevent_get_state(dev)))
> return -EINVAL;
>
> - if (dev->set_state_oneshot_stopped)
> + if (dev->set_state_oneshot_stopped) {
> + dev->next_event_forced = 0;
> return dev->set_state_oneshot_stopped(dev);
> - else
> + } else {
> return -ENOSYS;
> + }
>
> default:
> return -ENOSYS;
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index 7e57fa31ee26..115e0bf01276 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -108,6 +108,7 @@ static struct clock_event_device *tick_get_oneshot_wakeup_device(int cpu)
>
> static void tick_oneshot_wakeup_handler(struct clock_event_device *wd)
> {
> + wd->next_event_forced = 0;
> /*
> * If we woke up early and the tick was reprogrammed in the
> * meantime then this may be spurious but harmless.
This patch doesn't help me unfortunately. Thanks.
^ permalink raw reply
* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Eric Naim @ 2026-04-14 18:19 UTC (permalink / raw)
To: Calvin Owens
Cc: Hanabishi, Thomas Gleixner, LKML, Peter Zijlstra,
Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar, John Stultz,
Stephen Boyd, Alexander Viro, Christian Brauner, Jan Kara,
linux-fsdevel, Sebastian Reichel, linux-pm, Pablo Neira Ayuso,
Florian Westphal, Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <ad54kGakZkvCoRaT@mozart.vkv.me>
On 4/15/26 1:25 AM, Calvin Owens wrote:
> On Tuesday 04/14 at 15:39 +0000, Eric Naim wrote:
>> On 4/14/26 5:20 AM, Hanabishi wrote:
>>>
>>> Hello.
>>>
>>> Sorry, but this patch as of 7.0 introduced *severe* periodic lockups on my
>>> Ryzen 7700X machine.
>>> I see such messages in the log:
>>>
>>> clocksource: Long readout interval, skipping watchdog check: cs_nsec:
>>> 2897344852 wd_nsec: 2897356996
>>>
>>> Reverting d6e152d905bdb1f32f9d99775e2f453350399a6a for mainline fixes the
>>> issue for me.
>>>
>>
>> Hi maintainers,
>>
>> several users from CachyOS has reported this regression as well. We landed on
>> the same bisection. One of the users that could reproduce this reliably
>> reproduced this just by watching a YouTube video in a browser, and observed
>> freezes and stutters when interacting with the system.
>
> Huh, I can't reproduce this at all across 10+ machines. Can you share
> the Kconfig you're seeing this on?
Right, here it is [1]. CachyOS does carry a lot of downstream patches, but I
made sure to reproduce this on mainline before reporting here.
[1]
https://github.com/CachyOS/linux-cachyos/blob/4224303b6d7a50dd1cc3ffa78864050cc9536eec/linux-cachyos/config
>
> Thanks,
> Calvin
--
Regards,
Eric
^ permalink raw reply
* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Frederic Weisbecker @ 2026-04-14 18:04 UTC (permalink / raw)
To: Eric Naim
Cc: Hanabishi, Thomas Gleixner, LKML, Calvin Owens, Peter Zijlstra,
Anna-Maria Behnsen, Ingo Molnar, John Stultz, Stephen Boyd,
Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
Sebastian Reichel, linux-pm, Pablo Neira Ayuso, Florian Westphal,
Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <aeb848aa-404a-40fb-bd41-329644623b1d@cachyos.org>
Le Tue, Apr 14, 2026 at 03:39:00PM +0000, Eric Naim a écrit :
> On 4/14/26 5:20 AM, Hanabishi wrote:
> >
> > Hello.
> >
> > Sorry, but this patch as of 7.0 introduced *severe* periodic lockups on my
> > Ryzen 7700X machine.
> > I see such messages in the log:
> >
> > clocksource: Long readout interval, skipping watchdog check: cs_nsec:
> > 2897344852 wd_nsec: 2897356996
> >
> > Reverting d6e152d905bdb1f32f9d99775e2f453350399a6a for mainline fixes the
> > issue for me.
> >
>
> Hi maintainers,
>
> several users from CachyOS has reported this regression as well. We landed on
> the same bisection. One of the users that could reproduce this reliably
> reproduced this just by watching a YouTube video in a browser, and observed
> freezes and stutters when interacting with the system.
>
> I had an LLM generate a fix (patch attached), and it fixed the regression for
> that user. Full disclosure: it is written completely by AI, and I am also not
> familiar with this subsystem. I just hope that this patch can be helpful in
> fixing the regression.
>
> Please don't hesitate to tell me off if utilizing AI in this way is not
> helpful, so I can keep this in mind for future contributions.
>
>
> --
> Regards,
> Eric
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 38570998a19b..37b10045572e 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -332,8 +332,10 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
> if (delta > (int64_t)dev->min_delta_ns) {
> delta = min(delta, (int64_t) dev->max_delta_ns);
> clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
> - if (!dev->set_next_event((unsigned long) clc, dev))
> + if (!dev->set_next_event((unsigned long) clc, dev)) {
> + dev->next_event_forced = 0;
> return 0;
> + }
> }
>
> if (dev->next_event_forced)
> diff --git a/kernel/time/tick-oneshot.c b/kernel/time/tick-oneshot.c
> index 7472597f3225..bf411472d4f7 100644
> --- a/kernel/time/tick-oneshot.c
> +++ b/kernel/time/tick-oneshot.c
> @@ -34,6 +34,7 @@ int tick_program_event(ktime_t expires, int force)
> */
> clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT_STOPPED);
> dev->next_event = KTIME_MAX;
> + dev->next_event_forced = 0;
> return 0;
> }
>
> @@ -43,6 +44,7 @@ int tick_program_event(ktime_t expires, int force)
> * before using it.
> */
> clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT);
> + dev->next_event_forced = 0;
> }
>
> return clockevents_program_event(dev, expires, force);
That diff suggest that dev->next_event_forced is not properly cleared by
a handler or when the device is stopped.
For example it's not cleared when the device is oneshot stopped.
It's also not cleared when the device is detached (though that shouldn't
matter much) and also when the broadcast wakeup thing is used.
Can you try the following?
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index b4d730604972..5c6dfd6bed28 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -100,6 +100,7 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
/* The clockevent device is getting replaced. Shut it down. */
case CLOCK_EVT_STATE_SHUTDOWN:
+ dev->next_event_forced = 0;
if (dev->set_state_shutdown)
return dev->set_state_shutdown(dev);
return 0;
@@ -127,10 +128,12 @@ static int __clockevents_switch_state(struct clock_event_device *dev,
clockevent_get_state(dev)))
return -EINVAL;
- if (dev->set_state_oneshot_stopped)
+ if (dev->set_state_oneshot_stopped) {
+ dev->next_event_forced = 0;
return dev->set_state_oneshot_stopped(dev);
- else
+ } else {
return -ENOSYS;
+ }
default:
return -ENOSYS;
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 7e57fa31ee26..115e0bf01276 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -108,6 +108,7 @@ static struct clock_event_device *tick_get_oneshot_wakeup_device(int cpu)
static void tick_oneshot_wakeup_handler(struct clock_event_device *wd)
{
+ wd->next_event_forced = 0;
/*
* If we woke up early and the tick was reprogrammed in the
* meantime then this may be spurious but harmless.
^ permalink raw reply related
* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 17:32 UTC (permalink / raw)
To: Kairui Song
Cc: Liam.Howlett, akpm, apopple, axelrasmussen, baohua, baolin.wang,
bhe, byungchul, cgroups, chengming.zhou, chrisl, corbet, david,
dev.jain, gourry, hannes, hughd, jannh, joshua.hahnjy, lance.yang,
lenb, linux-doc, linux-kernel, linux-mm, linux-pm,
lorenzo.stoakes, matthew.brost, mhocko, muchun.song, npache,
pavel, peterx, peterz, pfalcato, rafael, rakie.kim,
roman.gushchin, rppt, ryan.roberts, shakeel.butt, shikemeng,
surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed, yuanchu,
zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAKEwX=NrUhUrAFx+8BYJEfaVKpCm-H9JhBzYSrqOQb-NW7QRug@mail.gmail.com>
On Tue, Apr 14, 2026 at 10:23 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> * I still think there's a good chance we can *significantly* close the
> gap overall between a design with virtual swap and a design without.
> It's a bit premature to commit to a vswap-optional route (which to be
> completely honest I'm still not confident is possible to satisfy all
> of our requirements).
And to further note - these benchmark measure, in effect, purely swap
overhead. In a production environment with a lot of non-swap work, as
long as the gap is close enough I think we would be fine, even for a
hostile case like a fast swapfile-backend (I assume SSD swap's
bottleneck will be the IO mostly).
I will stare at your responses to see if there is other benchmark I
can play with, but it would be very helpful if you can share your full
suite :)
^ permalink raw reply
* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Calvin Owens @ 2026-04-14 17:25 UTC (permalink / raw)
To: Eric Naim
Cc: Hanabishi, Thomas Gleixner, LKML, Peter Zijlstra,
Anna-Maria Behnsen, Frederic Weisbecker, Ingo Molnar, John Stultz,
Stephen Boyd, Alexander Viro, Christian Brauner, Jan Kara,
linux-fsdevel, Sebastian Reichel, linux-pm, Pablo Neira Ayuso,
Florian Westphal, Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <aeb848aa-404a-40fb-bd41-329644623b1d@cachyos.org>
On Tuesday 04/14 at 15:39 +0000, Eric Naim wrote:
> On 4/14/26 5:20 AM, Hanabishi wrote:
> >
> > Hello.
> >
> > Sorry, but this patch as of 7.0 introduced *severe* periodic lockups on my
> > Ryzen 7700X machine.
> > I see such messages in the log:
> >
> > clocksource: Long readout interval, skipping watchdog check: cs_nsec:
> > 2897344852 wd_nsec: 2897356996
> >
> > Reverting d6e152d905bdb1f32f9d99775e2f453350399a6a for mainline fixes the
> > issue for me.
> >
>
> Hi maintainers,
>
> several users from CachyOS has reported this regression as well. We landed on
> the same bisection. One of the users that could reproduce this reliably
> reproduced this just by watching a YouTube video in a browser, and observed
> freezes and stutters when interacting with the system.
Huh, I can't reproduce this at all across 10+ machines. Can you share
the Kconfig you're seeing this on?
Thanks,
Calvin
^ permalink raw reply
* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 17:23 UTC (permalink / raw)
To: Kairui Song
Cc: Liam.Howlett, akpm, apopple, axelrasmussen, baohua, baolin.wang,
bhe, byungchul, cgroups, chengming.zhou, chrisl, corbet, david,
dev.jain, gourry, hannes, hughd, jannh, joshua.hahnjy, lance.yang,
lenb, linux-doc, linux-kernel, linux-mm, linux-pm,
lorenzo.stoakes, matthew.brost, mhocko, muchun.song, npache,
pavel, peterx, peterz, pfalcato, rafael, rakie.kim,
roman.gushchin, rppt, ryan.roberts, shakeel.butt, shikemeng,
surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed, yuanchu,
zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAKEwX=P4syV38jAVCWq198r2OHXXc=xA-fx1dk6+qYef6yzxWQ@mail.gmail.com>
On Mon, Mar 23, 2026 at 1:05 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Mar 23, 2026 at 12:41 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2026 at 11:33 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Mon, Mar 23, 2026 at 6:09 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > > > This patch series is based on 6.19. There are a couple more
> > > > > swap-related changes in mainline that I would need to coordinate
> > > > > with, but I still want to send this out as an update for the
> > > > > regressions reported by Kairui Song in [15]. It's probably easier
> > > > > to just build this thing rather than dig through that series of
> > > > > emails to get the fix patch :)
> > > > >
> > > > > Changelog:
> > > > > * v4 -> v5:
> > > > > * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
> > > > > * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
> > > > > and use guard(rcu) in vswap_cpu_dead
> > > > > (reported by Peter Zijlstra [17]).
> > > > > * v3 -> v4:
> > > > > * Fix poor swap free batching behavior to alleviate a regression
> > > > > (reported by Kairui Song).
> > > >
> > >
> > > Hi Kairui! Thanks a lot for the testing big boss :) I will focus on
> > > the regression in this patch series - we can talk more about
> > > directions in another thread :)
Hi Kairui,
My apologies if I missed your response, but could you share with me
your full benchmark suite? It would be hugely useful, not just for
this series, but for all swap contributions in the future :) We should
do as much homework ourselves as possible :P
And apologies for the delayed response. I kept having to back and
forth between regression investigating, and figuring out what was
going on with the build setups (I missed some of the CONFIGs you had
originally), reducing variance on hosts, etc.
I don't have PMEM, so I have only worked with zram backend so far. I
did manage to reproduce the regressions you showed me (albeit at a
much smaller gap on certain metrics than your cited numbers, which I
suspect is due to zram/pmem difference).
There are two benchmarks that I focused on:
1. Usemem - the exact command I ran is: time ./usemem --init-time -O
-y -x -n 1 56G
My host is 32GB, 52 processor(s) / x86_64.
Build real (s) vs base sys (s) tput (KB/s)
free_ms
baseline 175.6 +/- 3.6 — 121.9 +/- 3.3 391,941 +/-
8,333 6,992 +/- 204
vss_v5 184.0 +/- 3.9 +4.8% 130.5 +/- 3.8 376,192 +/-
8,581 8,297 +/- 247
(I hope the formatting works, but let me know if it looks weird).
2. Memhog: time memhog 48G
My host for this one is 16 GB, 52 processors, x86_64 too.
Build real (s) vs base sys (s)
baseline 80.5 +/- 1.9 — 62.7 +/- 2.0
vss_v5 83.0 +/- 1.8 +3.1% 65.7 +/- 1.8
On both benchmark, I enable MGLRU, to more closely match the setup you had.
Staring at the run logs (and double check with the logs you sent me to
make sure it's not just on my system), there are some common patterns
I noticed across these runs:
1. Kswapd is slower on the vswap side, which shifts work towards
direct reclaim, and makes compaction have to run harder (which has a
weird contention through zsmalloc - I can expand further, but this is
not vswap-specific, just exacerbated by slower kswapd).
2. Higher swap readahead (albeit with higher hit rate) - this is more
of an artifact of the fact that zero swap pages are no longer backed
by zram swapfile, which skipped readahead in certain paths. We can
ignore this for now, but worth assessing this for fast swap backends
in general (zero swap pages, zswap, so on and so forth).
I spent sometimes perf-ing kswapd, and hack the usemem binary a bit so
that I can perf the free stage of usemem separately. Most of the
vswap-specific overhead lies in the xarray lookups. Some big offenders
on top of my mind:
1. Right now, in the physical swap allocator, whenever we have an
allocated slot in the range we're checking, we check if that slot is
swap-cache-only (i.e no swap count), and if so we try to free it (if
swapfile is almost full etc.). This check is cheap if all swap entry
metadata live in physical swap layer only, but more expensive when you
have to go through another layer of indirection :)
I fixed that by just taking one bit in the reverse map to track
swap-cache-only state, which eliminates this without extra space
overhead (on top of the existing design).
2. On the free path, in swap_pte_batch(), we check cgroup to make sure
that the range we pass to free_swap_and_cache_nr() belongs to the same
cgroup, which has a per-PTE overhead for going to the vswap layer. We
can make this check once-per range instead, to reduce overhead. Even
better - we can skip this check in swap_pte_batch() for the free case,
and deferred this check to later on where we already enter vswap
cluster lock context :)
With a bunch of changes like that, I closed the gap majorly:
usemem:
Build real (s) vs base sys (s) tput (KB/s)
free_ms
baseline 175.6 +/- 3.6 — 121.9 +/- 3.3 391,941 +/-
8,333 6,992 +/- 204
new_opt_v2 179.8 +/- 3.0 +2.4% 126.1 +/- 2.9 382,536 +/-
6,662 7,105 +/- 183
memhog:
Build real (s) vs base sys (s)
baseline 80.5 +/- 1.9 — 62.7 +/- 2.0
new_opt_v2 79.9 +/- 1.7 -0.8% 62.4 +/- 1.7
I would like to also point out that, some of this overhead is specific
to the swapfile backend case, which is why we don't see this in zswap
in the stats I included in V5. Zswap does not require this
swap-cache-only dance, because in virtual swap, zswap only needs the
virtual swap slot as the index (on top of much more negligible space
overhead thanks to zswap tree merging into vswap cluster, no swap
charging, no double allocation, etc.).
Anyway, still a small gap. The next idea that I have is inspired by
TLB, which cache virtual->physical memory address translation. I added
a per-CPU MRU virtual cluster. The idea is that a lot of consecutive
swap operations operate on the same range of swap entries - merging
these operations of course makes the most sense, but sometimes it's
not convenient to do it. The non-vswap, old design sometimes lock the
physical swap cluster and expose the swap cluster struct to callers to
pass around, but I would like to avoid that if possible :)
With this change, we close the gap even further - exceeding the
baseline in average in certain cases, but as you can see it's within
noises so I wouldn't conclude too much out of it:
usemem:
Build real (s) vs base sys (s) tput (KB/s)
free_ms
baseline 175.6 +/- 3.6 — 121.9 +/- 3.3 391,941 +/-
8,333 6,992 +/- 204
cc_v2 176.4 +/- 5.3 +0.4% 123.6 +/- 5.4 390,405 +/-
12,792 6,987 +/- 296
memhog:
Build real (s) vs base sys (s)
baseline 80.5 +/- 1.9 — 62.7 +/- 2.0
cc_v2 79.9 +/- 0.9 -0.8% 62.1 +/- 1.5
The reclaim and compaction stats tell a similar story:
Reclaim / Compaction (usemem)
Metric baseline
vss_v5 new_opt_v2 cc_v2
allocstall 167,787 +/- 10,292 170,532 +/-
15,185 169,782 +/- 9,903 168,635 +/- 13,526
pgsteal_kswapd 6,932,143 +/- 186,411 6,965,962 +/-
288,323 6,968,188 +/- 286,383 7,038,513 +/- 202,696
pgsteal_direct 9,759,350 +/- 480,674 9,978,721 +/-
765,543 9,899,698 +/- 480,781 9,845,668 +/- 544,319
swap_ra 82.9 +/- 22.6 5994.8 +/-
2817.5 4976.8 +/- 1484.2 4718.2 +/- 1510.5
pgmigrate 1,029,901 +/- 428,416 1,687,072 +/-
399,505 1,260,451 +/- 202,603 1,144,560 +/- 490,177
Reclaim / Compaction (memhog)
Metric baseline
vss_v5 new_opt_v2 cc_v2
allocstall 101,245 +/- 6,271 109,320 +/-
12,180 100,207 +/- 11,053 99,223 +/- 9,905
pgsteal_kswapd 8,817,264 +/- 432,519 8,436,548 +/-
265,763 8,728,944 +/- 305,101 8,962,443 +/- 589,012
pgsteal_direct 5,408,046 +/- 394,775 5,932,611 +/-
584,873 5,419,891 +/- 551,226 5,349,352 +/- 601,655
swap_ra 66.5 +/- 22.8 8589.5 +/-
3325.1 8954.5 +/- 2661.9 8703.1 +/- 1746.6
pgmigrate 239,410 +/- 46,014 277,193 +/-
71,487 320,672 +/- 59,488 243,989 +/- 136,129
You can see that the latter versions gradually restore the behaviors
of baseline in terms of reclaim dynamics :)
Some final remarks:
* I still think there's a good chance we can *significantly* close the
gap overall between a design with virtual swap and a design without.
It's a bit premature to commit to a vswap-optional route (which to be
completely honest I'm still not confident is possible to satisfy all
of our requirements).
* Regardless of the direction we take, these are all pitfalls that
will be problematic for virtual swap design, and more generally some
of them will affect any dynamic swap design (which has to go through
some sort of indirection or a dynamic data structure like xarray that
will induce some amount of lookup overhead). I hope my work here can
be useful in this sense too, outside of this specific vswap direction
:)
I will clean things up a bit and send you a v6 for further inspection.
Once again, I'd like to express my gratitude for your engagement and
feedback.
^ permalink raw reply
* [PATCH v2] PM: hibernate: keep existing uswsusp swap pin if re-selection fails
From: DaeMyung Kang @ 2026-04-14 16:49 UTC (permalink / raw)
To: Andrew Morton, Rafael J . Wysocki
Cc: Youngjun Park, Kairui Song, Chris Li, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Len Brown, Pavel Machek, linux-mm,
linux-pm, linux-kernel, DaeMyung Kang
In-Reply-To: <20260414143200.1267932-1-charsyam@gmail.com>
Commit 5b2b0c6e4577 ("mm/swap, PM: hibernate: fix swapoff race in
uswsusp by pinning swap device") introduced SWP_HIBERNATION so that
the swap area selected through /dev/snapshot remains protected against
swapoff() for the lifetime of the uswsusp session.
When user space issues SNAPSHOT_SET_SWAP_AREA again,
snapshot_set_swap_area() currently drops the old pin before attempting
to pin the new swap area. If the new selection fails, the ioctl
returns an error and user space is expected to abort the session.
However, preserving the existing pin in that case makes the kernel
side more robust against a failed re-selection, while keeping the
existing userspace-visible behavior unchanged.
Implement this with the existing swap helpers:
- look up the requested swap area first
- treat re-selecting the already pinned area as a no-op
- pin the new area before unpinning the old one
- leave the existing pin in place if the new pin attempt fails
This keeps the hibernation session protected against swapoff() until
/dev/snapshot is closed, even after a failed attempt to switch to a
different swap area.
Suggested-by: Youngjun Park <youngjun.park@lge.com>
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
---
Notes (not part of the commit, stripped by git am):
Changes in v2:
- Drop Fixes: and Cc: stable; reframe as a hardening improvement
rather than a regression fix, per Youngjun's feedback that the
current behavior is intentional and there is no concrete
user-observable harm.
- Drop the new repin_hibernation_swap_type() helper. Rework
snapshot_set_swap_area() in place using the existing find / pin /
unpin helpers as Youngjun suggested; the change now touches only
kernel/power/user.c and adds no new API.
- Update the subject and commit log accordingly.
- Add Suggested-by: trailer.
v1: https://lore.kernel.org/lkml/20260414143200.1267932-1-charsyam@gmail.com/
Baseline
--------
This patch is generated against linux-next at commit 5b2b0c6e4577
("mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap
device"). Mainline does not yet carry that commit, and neither the
helpers used here (find/pin/unpin_hibernation_swap_type) nor the code
site this patch modifies exist there. The base-commit trailer at the
bottom of the mbox records the exact commit.
Testing
-------
The behavior change can be exercised entirely through the
/dev/snapshot ioctl path; no actual hibernation cycle is required.
A targeted assertion test is below; run it as root in a throwaway VM
with two active swap block devices and one non-swap block device
(three arguments).
Run inside a VM on linux-next at 5b2b0c6e4577 with this patch applied:
step1: pinned active swap /dev/vda
step2: swapoff blocked with EBUSY while pin is held
step3: repinned active swap to /dev/vdb
step4: swapoff(/dev/vda) succeeded after repinning away
step5: repinned swap is blocked with EBUSY
step6: bogus SNAPSHOT_SET_SWAP_AREA failed as expected: No such device
step7: swapoff(/dev/vdb) is still blocked with EBUSY
result: pin preserved across failed re-set (hardened behavior)
step8: swapoff succeeded after closing /dev/snapshot
Without the patch, step7 instead reports
swapoff(/dev/vdb) succeeded after failed re-set
because the old pin had been released before the failed pin attempt.
What the assertion test covers:
- SWP_HIBERNATION is enforced against swapoff (step2, step5);
- the success path moves the pin from one active swap to another
(step3, step4, step5);
- a failed re-selection preserves the existing pin (step6, step7);
- the pin lifetime ends on /dev/snapshot close (step8).
What it does not cover:
- the snapshot_open(O_RDONLY) initial resume-device pin path;
- the full suspend-to-disk image create/restore flow;
- concurrent swapoff racing against SNAPSHOT_SET_SWAP_AREA;
- the type == data->swap idempotent branch (not externally
observable since it intentionally skips the bit toggle).
A normal sysfs-based suspend-to-disk cycle continues to work; the
find_hibernation_swap_type() / pin / unpin paths themselves are
unchanged. Build tested with allmodconfig and run-tested with
CONFIG_PROVE_LOCKING=y and CONFIG_KASAN=y. The VM was booted with
oops=panic panic=-1 so any WARN/Oops/BUG would have halted the run;
the full test completed cleanly with no kernel log diagnostics.
Reproducer (C source, for reference only -- not added to the tree):
// SPDX-License-Identifier: GPL-2.0
/*
* Reproduce / verify the SNAPSHOT_SET_SWAP_AREA pin-lifetime behavior.
*
* Run only inside a throwaway VM. The test manipulates swap state and
* leaves the target swap area disabled on success.
*
* Usage:
* ./uswsusp_swapoff_repro <active-swap-1> <active-swap-2> <bogus-blk>
*
* Exit codes:
* 0 = expected (hardened) behavior: pin preserved across failed re-set
* 1 = old behavior: pin dropped on failed re-set
* 2 = setup error / inconclusive
*/
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <linux/types.h>
#include <linux/suspend_ioctls.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/swap.h>
#include <sys/sysmacros.h>
#include <unistd.h>
static int encode_dev(dev_t dev)
{
unsigned int major_num = major(dev);
unsigned int minor_num = minor(dev);
/* Match new_encode_dev() / new_decode_dev() in the kernel. */
return (major_num & 0xfff) << 8 |
(minor_num & 0xff) |
((minor_num & ~0xff) << 12);
}
static int get_block_dev(const char *path, dev_t *dev)
{
struct stat st;
if (stat(path, &st) < 0) {
fprintf(stderr, "stat(%s): %s\n", path, strerror(errno));
return -errno;
}
if (!S_ISBLK(st.st_mode)) {
fprintf(stderr, "%s is not a block device\n", path);
return -EINVAL;
}
*dev = st.st_rdev;
return 0;
}
static int snapshot_set_swap_area(int fd, dev_t dev, long long offset)
{
struct resume_swap_area area = {
.offset = offset,
.dev = encode_dev(dev),
};
if (ioctl(fd, SNAPSHOT_SET_SWAP_AREA, &area) < 0)
return -errno;
return 0;
}
int main(int argc, char **argv)
{
const char *p1, *p2, *pb;
dev_t d1, d2, db;
int fd, ret;
bool buggy = false;
if (argc != 4) {
fprintf(stderr,
"usage: %s <swap1> <swap2> <bogus>\n", argv[0]);
return 2;
}
if (geteuid() != 0) {
fprintf(stderr, "must run as root\n");
return 2;
}
p1 = argv[1]; p2 = argv[2]; pb = argv[3];
if (get_block_dev(p1, &d1) < 0 ||
get_block_dev(p2, &d2) < 0 ||
get_block_dev(pb, &db) < 0)
return 2;
fd = open("/dev/snapshot", O_WRONLY);
if (fd < 0) {
fprintf(stderr, "open(/dev/snapshot): %s\n", strerror(errno));
return 2;
}
ret = snapshot_set_swap_area(fd, d1, 0);
if (ret < 0) { fprintf(stderr, "step1: %s\n", strerror(-ret)); goto setup_err; }
printf("step1: pinned active swap %s\n", p1);
if (swapoff(p1) == 0) {
fprintf(stderr, "step2: swapoff unexpectedly succeeded\n");
close(fd); return 1;
}
if (errno != EBUSY) {
fprintf(stderr, "step2: expected EBUSY, got %s\n", strerror(errno));
goto setup_err;
}
printf("step2: swapoff blocked with EBUSY while pin is held\n");
ret = snapshot_set_swap_area(fd, d2, 0);
if (ret < 0) { fprintf(stderr, "step3: %s\n", strerror(-ret)); goto setup_err; }
printf("step3: repinned active swap to %s\n", p2);
if (swapoff(p1) < 0) {
fprintf(stderr, "step4: swapoff(%s): %s\n", p1, strerror(errno));
goto setup_err;
}
printf("step4: swapoff(%s) succeeded after repinning away\n", p1);
if (swapoff(p2) == 0) {
fprintf(stderr, "step5: swapoff unexpectedly succeeded\n");
close(fd); return 1;
}
if (errno != EBUSY) {
fprintf(stderr, "step5: expected EBUSY, got %s\n", strerror(errno));
goto setup_err;
}
printf("step5: repinned swap is blocked with EBUSY\n");
ret = snapshot_set_swap_area(fd, db, 0);
if (!ret) {
fprintf(stderr, "step6: bogus unexpectedly succeeded\n");
goto setup_err;
}
printf("step6: bogus SNAPSHOT_SET_SWAP_AREA failed as expected: %s\n",
strerror(-ret));
if (swapoff(p2) == 0) {
printf("step7: swapoff(%s) succeeded after failed re-set\n", p2);
printf("result: pin was dropped on failure (old behavior)\n");
buggy = true;
} else if (errno == EBUSY) {
printf("step7: swapoff(%s) is still blocked with EBUSY\n", p2);
printf("result: pin preserved across failed re-set (hardened behavior)\n");
} else {
fprintf(stderr, "step7: unexpected: %s\n", strerror(errno));
goto setup_err;
}
close(fd);
if (!buggy) {
if (swapoff(p2) < 0) {
fprintf(stderr, "step8: swapoff(%s): %s\n", p2, strerror(errno));
return 2;
}
printf("step8: swapoff succeeded after closing /dev/snapshot\n");
}
printf("note: re-enable with `swapon %s` and `swapon %s`\n", p1, p2);
return buggy ? 1 : 0;
setup_err:
close(fd);
return 2;
}
kernel/power/user.c | 35 ++++++++++++++++++++++++++---------
1 file changed, 26 insertions(+), 9 deletions(-)
diff --git a/kernel/power/user.c b/kernel/power/user.c
index 4406f5644a56..e1ab85db2e95 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -218,6 +218,7 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
{
sector_t offset;
dev_t swdev;
+ int type, swap;
if (swsusp_swap_in_use())
return -EPERM;
@@ -239,18 +240,34 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
}
/*
- * Unpin the swap device if a swap area was already
- * set by SNAPSHOT_SET_SWAP_AREA.
+ * User space encodes device types as two-byte values, so we need to
+ * recode them.
*/
- unpin_hibernation_swap_type(data->swap);
+ type = find_hibernation_swap_type(swdev, offset);
+ if (type < 0)
+ return swdev ? -ENODEV : -EINVAL;
- /*
- * User space encodes device types as two-byte values,
- * so we need to recode them
- */
- data->swap = pin_hibernation_swap_type(swdev, offset);
- if (data->swap < 0)
+ if (type == data->swap) {
+ /*
+ * Re-selecting the already pinned swap area is a no-op.
+ * Keep the existing pin and just refresh the cached device id.
+ */
+ data->dev = swdev;
+ return 0;
+ }
+
+ swap = pin_hibernation_swap_type(swdev, offset);
+ if (swap < 0) {
+ /*
+ * Preserve the existing pin on failure. This can happen if the
+ * target swap area disappears before pinning, or via the
+ * defensive -EBUSY path in pin_hibernation_swap_type().
+ */
return swdev ? -ENODEV : -EINVAL;
+ }
+
+ unpin_hibernation_swap_type(data->swap);
+ data->swap = swap;
data->dev = swdev;
return 0;
}
base-commit: 5b2b0c6e457765adbe96fb2d464ff1bcd3d72158
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 16:35 UTC (permalink / raw)
To: Kairui Song
Cc: YoungJun Park, Liam.Howlett, akpm, apopple, axelrasmussen, baohua,
baolin.wang, bhe, byungchul, cgroups, chengming.zhou, chrisl,
corbet, david, dev.jain, gourry, hannes, hughd, jannh,
joshua.hahnjy, lance.yang, lenb, linux-doc, linux-kernel,
linux-mm, linux-pm, lorenzo.stoakes, matthew.brost, mhocko,
muchun.song, npache, pavel, peterx, peterz, pfalcato, rafael,
rakie.kim, roman.gushchin, rppt, ryan.roberts, shakeel.butt,
shikemeng, surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed,
yuanchu, zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAMgjq7BO6SLZPfNXDh1F-7RAOqDAfqMQ4PM=qjAq1mCsWyD0LQ@mail.gmail.com>
On Mon, Apr 13, 2026 at 8:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Apr 14, 2026 at 11:05 AM YoungJun Park <youngjun.park@lge.com> wrote:
> >
>
> Hi All,
>
> > On Sat, Apr 11, 2026 at 06:40:44PM -0700, Nhat Pham wrote:
> > > > 1. Modularization
> > > >
> > > > You removed CONFIG_* and went with a unified approach. I recall
> > > > you were also considering a module-based structure at some point.
> > > > What are your thoughts on that direction?
> > > >
> > >
> > > The CONFIG-based approach was a huge mess. It makes me not want to
> > > look at the code, and I'm the author :)
> > >
> > > > If we take that approach, we could extend the recent swap ops
> > > > patchset (https://lore.kernel.org/linux-mm/20260302104016.163542-1-bhe@redhat.com/)
> > > > as follows:
> > > > - Make vswap a swap module
> > > > - Have cluster allocation functions reside in swapops
> > > > - Enable vswap through swapon
> > >
> > > Hmmmmm.
> >
> > I think this would be a happy world, but I wonder what others think.
> > Anyway, I'm looking forward to the future direction.
> >
>
> Yeah, I agree with this.
>
> And I do think swapoff of the virtual space itself is also necessary,
> we really need a failsafe, e.g. a clean way to drop the swap
> cache and data, kind of like drop_caches or shrinker fs are
> commonly used.
>
> > > > 2. Flash-friendly swap integration (for my use case)
> > > >
> > > > I've been thinking about the flash-friendly swap concept that
> > > > I mentioned before and recently proposed:
> > > > (https://lore.kernel.org/linux-mm/aZW0voL4MmnMQlaR@yjaykim-PowerEdge-T330/)
> > > >
> > > > One of its core functions requires buffering RAM-swapped pages
> > > > and writing them sequentially at an appropriate time -- not
> > > > immediately, but in proper block-sized units, sequentially.
> > > >
> > > > This means allocated offsets must essentially be virtual, and
> > > > physical offsets need to be managed separately at the actual
> > > > write time.
> > > >
> > > > If we integrate this into the current vswap, we would either
> > > > need vswap itself to handle the sequential writes (bypassing
> > > > the physical device and receiving pages directly), or swapon
> > > > a swap device and have vswap obtain physical offsets from it.
> > > > But since those offsets cannot be used directly (due to
> > > > buffering and sequential write requirements), they become
> > > > virtual too, resulting in:
> > > >
> > > > virtual -> virtual -> physical
> > > >
> > > > This triple indirection is not ideal.
> > > >
> > > > However, if the modularization from point 1 is achieved and
> > > > vswap acts as a swap device itself, then we can cleanly
> > > > establish a:
> > > >
> > > > virtual -> physical
> > >
> > > I read that thread sometimes ago. Some remarks:
> > >
> > > 1. I think Christoph has a point. Seems like some of your ideas ( are
> > > broadly applicable to swap in general. Maybe fixing swap infra
> > > generally would make a lot of sense?
> >
> > Broadly speaking, there are two main ideas:
> > 1. Swap I/O buffering (which is also tied to cluster management issues)
> > 2. Deduplication
> >
> > Are you leaning towards the view that these two should be placed in a
> > higher layer?
>
> IMHO the swap infra should be doing less, not more, so we can have
> more flexible design, and different backends can implement their own
> way to manage the data and layer. e.g. Having one backend being
> flash friendly and it can do this without caring or affecting other devices
> or backends.
I think that's what Youngjun already has, unless I misunderstand his
descriptions.
>
> > If it goes into ZSWAP, there would definitely be a clear advantage of
> > seeing dedup benefits across all swap devices. It's a technically
> > interesting area, and I'd like to discuss it in a separate thread if
> > I have more ideas or thoughts.
>
> Just branstorm... Why don't we just merge these identical pages like
> KSM? Maybe at least zero folios might benefit a lot if we keep them
> mapped as RO instead of recording them in swap, seems better in the
> long term?
That's our preferred approach too. We just didn't manage to get that
to work (yet). :)
^ permalink raw reply
* Re: [PATCH] PM: hibernate: preserve uswsusp swap pin across SNAPSHOT_SET_SWAP_AREA re-set failures
From: YoungJun Park @ 2026-04-14 16:18 UTC (permalink / raw)
To: DaeMyung Kang
Cc: Andrew Morton, Rafael J . Wysocki, Kairui Song, Chris Li,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Len Brown,
Pavel Machek, linux-mm, linux-pm, linux-kernel, stable
In-Reply-To: <20260414143200.1267932-1-charsyam@gmail.com>
On Tue, Apr 14, 2026 at 11:32:00PM +0900, DaeMyung Kang wrote:
Hi Daemyung :)
> Commit 5b2b0c6e4577 ("mm/swap, PM: hibernate: fix swapoff
> race in uswsusp by pinning swap device") introduced
> SWP_HIBERNATION so that the swap device chosen via
> /dev/snapshot is held against swapoff for the entire uswsusp
> session. The intended invariant is: from the first successful
> SNAPSHOT_SET_SWAP_AREA until the /dev/snapshot fd is closed,
> exactly one swap device is pinned.
>
> snapshot_set_swap_area() breaks that invariant on the re-set
> path:
>
> unpin_hibernation_swap_type(data->swap);
> data->swap = pin_hibernation_swap_type(swdev, offset);
> if (data->swap < 0)
> return swdev ? -ENODEV : -EINVAL;
>
> The unpin happens unconditionally before the new pin is
> attempted. If the new pin fails (e.g. user space supplies an
> offset/device that is not an active swap area), the session
> continues with no swap device pinned, reopening exactly the
> swapoff race the original commit was meant to close. A
> subsequent swapoff on the previously selected device now
> succeeds where it would have been blocked with EBUSY.
Hmm.. This was actually intentional. The API semantic
is that a second SNAPSHOT_SET_SWAP_AREA abandons the
previous pin. If the new pin fails, the ioctl returns
an error and userspace is responsible for aborting the
session -- proceeding without a pinned device makes no
sense.
The only benefit of preserving the old pin on failure
would be protecting against userspace that ignores the
error. But even in that case, the session has no valid
swap target, so the hibernation image write itself
would fail before swapoff becomes a concern. I think
this is an edge case rather than a bug.
IOW, Looking at your test case, I think this part is
userspace's responsibility:
> ret = snapshot_set_swap_area(fd, bogus_dev, 0);
> if (!ret) {
> fprintf(stderr,
> "step6: bogus SNAPSHOT_SET_SWAP_AREA unexpectedly succeeded\n");
> close(fd);
> return 2;
> }
The ioctl has already returned an error here. Userspace
should close /dev/snapshot and stop, not continue and
expect the old pin to still be in place.
(BTW, the tests are well written and easy to follow.
Thanks!)
For this patch to have real value, there should be
something that concretely breaks after the swapoff
succeeds. But since the session has no valid swap
target at that point, is there any actual broken
behavior that follows?
> if (!buggy) {
> if (swapoff(swap_path_2) < 0) {
> fprintf(stderr,
> "step8: swapoff(%s) after close failed: %s\n",
> swap_path_2, strerror(errno));
> return 2;
> }
> printf("step8: swapoff succeeded after closing /dev/snapshot\n");
> }
If you still see concrete value, I would be happy to
take this as an improvement (without Fixes: and
Cc: stable) -- see my suggestion below for a lighter
approach.
> Reordering pin/unpin in the caller cannot fix this
> cleanly. Each of pin_hibernation_swap_type() /
> unpin_hibernation_swap_type() acquires swap_lock
> independently, so any two-call sequence leaves a window
> in which swapoff can observe an inconsistent pin state.
> The same-area re-set case (type == old_type) also cannot
> be expressed with pin+unpin without either toggling the
> bit (racy) or returning EBUSY (a false error).
>
> Introduce repin_hibernation_swap_type(), which performs
> the transition atomically under a single swap_lock
> acquisition:
I understand the intent. If you still see enough value
in preserving the pin on failure, I would suggest a
lighter approach -- see below.
> - unpin_hibernation_swap_type(data->swap);
> -
> - data->swap = pin_hibernation_swap_type(swdev, offset);
> - if (data->swap < 0)
> + swap = repin_hibernation_swap_type(data->swap, swdev,
> + offset);
> + if (swap < 0)
> return swdev ? -ENODEV : -EINVAL;
> + data->swap = swap;
Would it be simpler to use find_hibernation_swap_type()
to look up the new type first, and if it differs from
data->swap, call pin_hibernation_swap_type() on the new
one? If pin succeeds, unpin the old one. If it returns
-EBUSY, just keep the existing pin.
If swapoff sneaks in between find and pin, pin will
simply fail -- I don't think the kernel needs to
guarantee atomicity for that window. This does acquire
swap_lock multiple times, but SNAPSHOT_SET_SWAP_AREA is
an extremely rare operation, so the extra lock
round-trips should be negligible.
Reusing the existing helpers seems preferable to adding
a new function with this much code for a single call
site.
How do you think?
Thanks again!
Youngjun Park
^ permalink raw reply
* Re: The "clockevents: Prevent timer interrupt starvation" patch causes lockups
From: Eric Naim @ 2026-04-14 15:39 UTC (permalink / raw)
To: Hanabishi, Thomas Gleixner, LKML
Cc: Calvin Owens, Peter Zijlstra, Anna-Maria Behnsen,
Frederic Weisbecker, Ingo Molnar, John Stultz, Stephen Boyd,
Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
Sebastian Reichel, linux-pm, Pablo Neira Ayuso, Florian Westphal,
Phil Sutter, netfilter-devel, coreteam
In-Reply-To: <68d1e9ac-2780-4be3-8ee3-0788062dd3a4@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1103 bytes --]
On 4/14/26 5:20 AM, Hanabishi wrote:
>
> Hello.
>
> Sorry, but this patch as of 7.0 introduced *severe* periodic lockups on my
> Ryzen 7700X machine.
> I see such messages in the log:
>
> clocksource: Long readout interval, skipping watchdog check: cs_nsec:
> 2897344852 wd_nsec: 2897356996
>
> Reverting d6e152d905bdb1f32f9d99775e2f453350399a6a for mainline fixes the
> issue for me.
>
Hi maintainers,
several users from CachyOS has reported this regression as well. We landed on
the same bisection. One of the users that could reproduce this reliably
reproduced this just by watching a YouTube video in a browser, and observed
freezes and stutters when interacting with the system.
I had an LLM generate a fix (patch attached), and it fixed the regression for
that user. Full disclosure: it is written completely by AI, and I am also not
familiar with this subsystem. I just hope that this patch can be helpful in
fixing the regression.
Please don't hesitate to tell me off if utilizing AI in this way is not
helpful, so I can keep this in mind for future contributions.
--
Regards,
Eric
[-- Attachment #2: ai.patch --]
[-- Type: text/x-patch, Size: 1283 bytes --]
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 38570998a19b..37b10045572e 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -332,8 +332,10 @@ int clockevents_program_event(struct clock_event_device *dev, ktime_t expires,
if (delta > (int64_t)dev->min_delta_ns) {
delta = min(delta, (int64_t) dev->max_delta_ns);
clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
- if (!dev->set_next_event((unsigned long) clc, dev))
+ if (!dev->set_next_event((unsigned long) clc, dev)) {
+ dev->next_event_forced = 0;
return 0;
+ }
}
if (dev->next_event_forced)
diff --git a/kernel/time/tick-oneshot.c b/kernel/time/tick-oneshot.c
index 7472597f3225..bf411472d4f7 100644
--- a/kernel/time/tick-oneshot.c
+++ b/kernel/time/tick-oneshot.c
@@ -34,6 +34,7 @@ int tick_program_event(ktime_t expires, int force)
*/
clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT_STOPPED);
dev->next_event = KTIME_MAX;
+ dev->next_event_forced = 0;
return 0;
}
@@ -43,6 +44,7 @@ int tick_program_event(ktime_t expires, int force)
* before using it.
*/
clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT);
+ dev->next_event_forced = 0;
}
return clockevents_program_event(dev, expires, force);
^ permalink raw reply related
* Re: [PATCH v2] cpufreq: Fix hotplug-suspend race during reboot
From: Zhongqiu Han @ 2026-04-14 14:44 UTC (permalink / raw)
To: Tianxiang Chen, rafael; +Cc: viresh.kumar, lingyue, linux-pm, linux-kernel
In-Reply-To: <20260408141914.35281-1-nanmu@xiaomi.com>
On 4/8/2026 10:19 PM, Tianxiang Chen wrote:
> During system reboot, cpufreq_suspend() is called via the
> kernel_restart() -> device_shutdown() -> pm_notifier_call_chain()
> path. Unlike the normal system suspend path, the reboot path does not
> call freeze_processes(), so userspace processes and kernel threads
> remain active.
>
> This allows CPU hotplug operations to run concurrently with
> cpufreq_suspend(). The original code has no synchronization with CPU
> hotplug, leading to a race condition where governor_data can be freed
> by the hotplug path while cpufreq_suspend() is still accessing it,
> resulting in a null pointer dereference:
>
> Unable to handle kernel NULL pointer dereference
> Call Trace:
> do_kernel_fault+0x28/0x3c
> cpufreq_suspend+0xdc/0x160
> device_shutdown+0x18/0x200
> kernel_restart+0x40/0x80
> arm64_sys_reboot+0x1b0/0x200
>
> Fix this by adding cpus_read_lock()/cpus_read_unlock() to
> cpufreq_suspend() to block CPU hotplug operations while suspend is in
> progress.
>
> Signed-off-by: Tianxiang Chen <nanmu@xiaomi.com>
> ---
> v2:
> - Update changelog to explicitly mention reboot scenario
> - Add observed crash trace
> ---
> drivers/cpufreq/cpufreq.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 1f794524a1d9..6f1d264c378b 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -1979,6 +1979,7 @@ void cpufreq_suspend(void)
> if (!cpufreq_driver)
> return;
>
> + cpus_read_lock();
> if (!has_target() && !cpufreq_driver->suspend)
> goto suspend;
>
> @@ -1998,6 +1999,7 @@ void cpufreq_suspend(void)
>
> suspend:
> cpufreq_suspended = true;
> + cpus_read_unlock();
> }
>
> /**
Hi Tianxiang,
May I know did you test this with lockdep enabled? Specifically, does
the new cpus_read_lock() → policy->rwsem ordering in cpufreq_suspend()
trigger any lockdep warnings? Thanks
--
Thx and BRs,
Zhongqiu Han
^ permalink raw reply
* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Pavlu @ 2026-04-14 14:33 UTC (permalink / raw)
To: chensong_2000
Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe, jikos, mbenes,
pmladek, joe.lawrence, rostedt, mhiramat, mark.rutland,
mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260413080701.180976-1-chensong_2000@189.cn>
On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
> From: Song Chen <chensong_2000@189.cn>
>
> ftrace and livepatch currently have their module load/unload callbacks
> hard-coded in the module loader as direct function calls to
> ftrace_module_enable(), klp_module_coming(), klp_module_going()
> and ftrace_release_mod(). This tight coupling was originally introduced
> to enforce strict call ordering that could not be guaranteed by the
> module notifier chain, which only supported forward traversal. Their
> notifiers were moved in and out back and forth. see [1] and [2].
I'm unclear about what is meant by the notifiers being moved back and
forth. The links point to patches that converted ftrace+klp from using
module notifiers to explicit callbacks due to ordering issues, but this
switch occurred only once. Have there been other attempts to use
notifiers again?
>
> Now that the notifier chain supports reverse traversal via
> blocking_notifier_call_chain_reverse(), the ordering can be enforced
> purely through notifier priority. As a result, the module loader is now
> decoupled from the implementation details of ftrace and livepatch.
> What's more, adding a new subsystem with symmetric setup/teardown ordering
> requirements during module load/unload no longer requires modifying
> kernel/module/main.c; it only needs to register a notifier_block with an
> appropriate priority.
>
> [1]:https://lore.kernel.org/all/
> alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/
> [2]:https://lore.kernel.org/all/
> 20160301030034.GC12120@packer-debian-8-amd64.digitalocean.com/
Nit: Avoid wrapping URLs, as it breaks autolinking and makes the links
harder to copy.
Better links would be:
[1] https://lore.kernel.org/all/1455661953-15838-1-git-send-email-jeyu@redhat.com/
[2] https://lore.kernel.org/all/1458176139-17455-1-git-send-email-jeyu@redhat.com/
The first link is the final version of what landed as commit
7dcd182bec27 ("ftrace/module: remove ftrace module notifier"). The
second is commit 7e545d6eca20 ("livepatch/module: remove livepatch
module notifier").
>
> Signed-off-by: Song Chen <chensong_2000@189.cn>
> ---
> include/linux/module.h | 8 ++++++++
> kernel/livepatch/core.c | 29 ++++++++++++++++++++++++++++-
> kernel/module/main.c | 34 +++++++++++++++-------------------
> kernel/trace/ftrace.c | 38 ++++++++++++++++++++++++++++++++++++++
> 4 files changed, 89 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 14f391b186c6..0bdd56f9defd 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -308,6 +308,14 @@ enum module_state {
> MODULE_STATE_COMING, /* Full formed, running module_init. */
> MODULE_STATE_GOING, /* Going away. */
> MODULE_STATE_UNFORMED, /* Still setting it up. */
> + MODULE_STATE_FORMED,
I don't see a reason to add a new module state. Why is it necessary and
how does it fit with the existing states?
> +};
> +
> +enum module_notifier_prio {
> + MODULE_NOTIFIER_PRIO_LOW = INT_MIN, /* Low prioroty, coming last, going first */
> + MODULE_NOTIFIER_PRIO_MID = 0, /* Normal priority. */
> + MODULE_NOTIFIER_PRIO_SECOND_HIGH = INT_MAX - 1, /* Second high priorigy, coming second*/
> + MODULE_NOTIFIER_PRIO_HIGH = INT_MAX, /* High priorigy, coming first, going late. */
I suggest being explicit about how the notifiers are ordered. For
example:
enum module_notifier_prio {
MODULE_NOTIFIER_PRIO_NORMAL, /* Normal priority, coming last, going first. */
MODULE_NOTIFIER_PRIO_LIVEPATCH,
MODULE_NOTIFIER_PRIO_FTRACE, /* High priority, coming first, going late. */
};
> };
>
> struct mod_tree_node {
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index 28d15ba58a26..ce78bb23e24b 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -1375,13 +1375,40 @@ void *klp_find_section_by_name(const struct module *mod, const char *name,
> }
> EXPORT_SYMBOL_GPL(klp_find_section_by_name);
>
> +static int klp_module_callback(struct notifier_block *nb, unsigned long op,
> + void *module)
> +{
> + struct module *mod = module;
> + int err = 0;
> +
> + switch (op) {
> + case MODULE_STATE_COMING:
> + err = klp_module_coming(mod);
> + break;
> + case MODULE_STATE_LIVE:
> + break;
> + case MODULE_STATE_GOING:
> + klp_module_going(mod);
> + break;
> + default:
> + break;
> + }
klp_module_coming() and klp_module_going() are now used only in
kernel/livepatch/core.c where they are also defined. This means the
functions can be static and their declarations removed from
include/linux/livepatch.h.
Nit: The MODULE_STATE_LIVE and default cases in the switch can be
removed.
> +
> + return notifier_from_errno(err);
> +}
> +
> +static struct notifier_block klp_module_nb = {
> + .notifier_call = klp_module_callback,
> + .priority = MODULE_NOTIFIER_PRIO_SECOND_HIGH
> +};
> +
> static int __init klp_init(void)
> {
> klp_root_kobj = kobject_create_and_add("livepatch", kernel_kobj);
> if (!klp_root_kobj)
> return -ENOMEM;
>
> - return 0;
> + return register_module_notifier(&klp_module_nb);
> }
>
> module_init(klp_init);
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index c3ce106c70af..226dd5b80997 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -833,10 +833,8 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
> /* Final destruction now no one is using it. */
> if (mod->exit != NULL)
> mod->exit();
> - blocking_notifier_call_chain(&module_notify_list,
> + blocking_notifier_call_chain_reverse(&module_notify_list,
> MODULE_STATE_GOING, mod);
> - klp_module_going(mod);
> - ftrace_release_mod(mod);
>
> async_synchronize_full();
>
> @@ -3135,10 +3133,8 @@ static noinline int do_init_module(struct module *mod)
> mod->state = MODULE_STATE_GOING;
> synchronize_rcu();
> module_put(mod);
> - blocking_notifier_call_chain(&module_notify_list,
> + blocking_notifier_call_chain_reverse(&module_notify_list,
> MODULE_STATE_GOING, mod);
> - klp_module_going(mod);
> - ftrace_release_mod(mod);
> free_module(mod);
> wake_up_all(&module_wq);
>
The patch unexpectedly leaves a call to ftrace_free_mem() in
do_init_module().
> @@ -3281,20 +3277,14 @@ static int complete_formation(struct module *mod, struct load_info *info)
> return err;
> }
>
> -static int prepare_coming_module(struct module *mod)
> +static int prepare_module_state_transaction(struct module *mod,
> + unsigned long val_up, unsigned long val_down)
> {
> int err;
>
> - ftrace_module_enable(mod);
> - err = klp_module_coming(mod);
> - if (err)
> - return err;
> -
> err = blocking_notifier_call_chain_robust(&module_notify_list,
> - MODULE_STATE_COMING, MODULE_STATE_GOING, mod);
> + val_up, val_down, mod);
> err = notifier_to_errno(err);
> - if (err)
> - klp_module_going(mod);
>
> return err;
> }
> @@ -3468,14 +3458,21 @@ static int load_module(struct load_info *info, const char __user *uargs,
> init_build_id(mod, info);
>
> /* Ftrace init must be called in the MODULE_STATE_UNFORMED state */
> - ftrace_module_init(mod);
> + err = prepare_module_state_transaction(mod,
> + MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
I believe val_down should be MODULE_STATE_GOING to reverse the
operation. Why is the new state MODULE_STATE_FORMED needed here?
> + if (err)
> + goto ddebug_cleanup;
>
> /* Finally it's fully formed, ready to start executing. */
> err = complete_formation(mod, info);
> - if (err)
> + if (err) {
> + blocking_notifier_call_chain_reverse(&module_notify_list,
> + MODULE_STATE_FORMED, mod);
> goto ddebug_cleanup;
> + }
>
> - err = prepare_coming_module(mod);
> + err = prepare_module_state_transaction(mod,
> + MODULE_STATE_COMING, MODULE_STATE_GOING);
> if (err)
> goto bug_cleanup;
>
> @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
> destroy_params(mod->kp, mod->num_kp);
> blocking_notifier_call_chain(&module_notify_list,
> MODULE_STATE_GOING, mod);
My understanding is that all notifier chains for MODULE_STATE_GOING
should be reversed.
> - klp_module_going(mod);
> bug_cleanup:
> mod->state = MODULE_STATE_GOING;
> /* module_bug_cleanup needs module_mutex protection */
The patch removes the klp_module_going() cleanup call in load_module().
Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
should be removed and appropriately replaced with a cleanup via
a notifier.
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 8df69e702706..efedb98d3db4 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -5241,6 +5241,44 @@ static int __init ftrace_mod_cmd_init(void)
> }
> core_initcall(ftrace_mod_cmd_init);
>
> +static int ftrace_module_callback(struct notifier_block *nb, unsigned long op,
> + void *module)
> +{
> + struct module *mod = module;
> +
> + switch (op) {
> + case MODULE_STATE_UNFORMED:
> + ftrace_module_init(mod);
> + break;
> + case MODULE_STATE_COMING:
> + ftrace_module_enable(mod);
> + break;
> + case MODULE_STATE_LIVE:
> + ftrace_free_mem(mod, mod->mem[MOD_INIT_TEXT].base,
> + mod->mem[MOD_INIT_TEXT].base + mod->mem[MOD_INIT_TEXT].size);
> + break;
> + case MODULE_STATE_GOING:
> + case MODULE_STATE_FORMED:
> + ftrace_release_mod(mod);
> + break;
> + default:
> + break;
> + }
ftrace_module_init(), ftrace_module_enable(), ftrace_free_mem() and
ftrace_release_mod() should be newly used only in kernel/trace/ftrace.c
where they are also defined. The functions can then be made static and
removed from include/linux/ftrace.h.
Nit: The default case in the switch can be removed.
> +
> + return notifier_from_errno(0);
Nit: This can be simply "return NOTIFY_OK;".
> +}
> +
> +static struct notifier_block ftrace_module_nb = {
> + .notifier_call = ftrace_module_callback,
> + .priority = MODULE_NOTIFIER_PRIO_HIGH
> +};
> +
> +static int __init ftrace_register_module_notifier(void)
> +{
> + return register_module_notifier(&ftrace_module_nb);
> +}
> +core_initcall(ftrace_register_module_notifier);
> +
> static void function_trace_probe_call(unsigned long ip, unsigned long parent_ip,
> struct ftrace_ops *op, struct ftrace_regs *fregs)
> {
--
Thanks,
Petr
^ permalink raw reply
* [PATCH] PM: hibernate: preserve uswsusp swap pin across SNAPSHOT_SET_SWAP_AREA re-set failures
From: DaeMyung Kang @ 2026-04-14 14:32 UTC (permalink / raw)
To: Andrew Morton, Rafael J . Wysocki
Cc: Youngjun Park, Kairui Song, Chris Li, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Len Brown, Pavel Machek, linux-mm,
linux-pm, linux-kernel, DaeMyung Kang, stable
Commit 5b2b0c6e4577 ("mm/swap, PM: hibernate: fix swapoff race in uswsusp
by pinning swap device") introduced SWP_HIBERNATION so that the swap
device chosen via /dev/snapshot is held against swapoff for the entire
uswsusp session. The intended invariant is: from the first successful
SNAPSHOT_SET_SWAP_AREA until the /dev/snapshot fd is closed, exactly one
swap device is pinned.
snapshot_set_swap_area() breaks that invariant on the re-set path:
unpin_hibernation_swap_type(data->swap);
data->swap = pin_hibernation_swap_type(swdev, offset);
if (data->swap < 0)
return swdev ? -ENODEV : -EINVAL;
The unpin happens unconditionally before the new pin is attempted. If
the new pin fails (e.g. user space supplies an offset/device that is not
an active swap area), the session continues with no swap device pinned,
reopening exactly the swapoff race the original commit was meant to
close. A subsequent swapoff on the previously selected device now
succeeds where it would have been blocked with EBUSY.
As a secondary consequence, data->swap is overwritten with the negative
error return from pin_hibernation_swap_type(). The value is harmless at
close time (swap_type_to_info() on the invalid type returns NULL, so the
release-side unpin is a no-op and there is no pin to leak), but leaving
a negative sentinel in data->swap for the rest of the session is still
a state-hygiene defect: any future reader of data->swap cannot
distinguish it from a never-set session.
The bug is observable with ioctls alone; it does not require an actual
hibernation cycle. A user-space caller that supplies one valid and then
one invalid resume_swap_area is enough to strand the session without a
pin.
Reordering pin/unpin in the caller cannot fix this cleanly. Each of
pin_hibernation_swap_type() / unpin_hibernation_swap_type() acquires
swap_lock independently, so any two-call sequence leaves a window in
which swapoff can observe an inconsistent pin state. The same-area
re-set case (type == old_type) also cannot be expressed with pin+unpin
without either toggling the bit (racy) or returning EBUSY (a false
error).
Introduce repin_hibernation_swap_type(), which performs the transition
atomically under a single swap_lock acquisition:
- verify that old_type, if held, still carries SWP_HIBERNATION;
- look up the new swap area;
- if it is the same as old_type, return without touching any flags;
- otherwise clear SWP_HIBERNATION on the old si and set it on the
new si within the same critical section;
- on any failure, return without modifying either si's flags, so the
previous pin is preserved.
Update snapshot_set_swap_area() to use the new helper and to stage the
result in a local variable, committing to data->swap only on success.
This closes the protection-loss window and also avoids the data->swap
corruption on failure.
Fixes: 5b2b0c6e4577 ("mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device")
Cc: stable@vger.kernel.org
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
---
Notes (not part of the commit, stripped by git am):
Baseline
--------
This patch is generated against linux-next at commit 5b2b0c6e4577
("mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap
device"). Mainline does not yet carry that commit, and neither the
helpers it introduces (pin/unpin_hibernation_swap_type) nor the code
site this patch modifies exist there. The base-commit trailer at the
bottom of the mbox records the exact commit.
Testing
-------
The bug does not require an actual hibernation cycle. The ioctl path
alone is enough to re-open the swapoff race. A targeted reproducer is
included below; run it as root in a throwaway VM with two active swap
block devices and one non-swap block device (three arguments).
Run inside a VM on linux-next at 5b2b0c6e4577 with this patch applied:
step1: pinned active swap /dev/vda
step2: swapoff blocked with EBUSY while pin is held
step3: repinned active swap to /dev/vdb
step4: swapoff(/dev/vda) succeeded after repinning away
step5: repinned swap is blocked with EBUSY
step6: bogus SNAPSHOT_SET_SWAP_AREA failed as expected: No such device
step7: swapoff(/dev/vdb) is still blocked with EBUSY
result: FIXED kernel, hibernation pin was preserved
step8: swapoff succeeded after closing /dev/snapshot
Run on the same tree without this patch applied: step7 instead reports
"swapoff(/dev/vdb) succeeded after failed re-set" and the program exits
with status 1 ("BUGGY kernel, hibernation pin was dropped").
What the reproducer covers:
- SWP_HIBERNATION is actually enforced against swapoff (step2, step5);
- the success path of repin_hibernation_swap_type() atomically moves
the pin from one active swap to another (step3, step4, step5);
- the failure path of repin_hibernation_swap_type() preserves the
existing pin (step6, step7);
- the pin lifetime ends on /dev/snapshot close (step8).
What it does not cover:
- snapshot_open(O_RDONLY) initial resume-device pin path;
- the full suspend-to-disk image create/restore flow;
- concurrent swapoff racing against SNAPSHOT_SET_SWAP_AREA;
- the type == old_type idempotent branch (not externally observable).
A normal sysfs-based suspend-to-disk cycle continues to work; the
find_hibernation_swap_type() path is unchanged. Build tested with
allmodconfig and run-tested with CONFIG_PROVE_LOCKING=y and
CONFIG_KASAN=y. The VM was booted with oops=panic panic=-1 so any
WARN/Oops/BUG would have halted the run; the full test completed
cleanly with no kernel log diagnostics, including the three
WARN_ON_ONCE() invariant checks inside repin_hibernation_swap_type().
Reproducer (C source, for reference only -- not added to the tree):
// SPDX-License-Identifier: GPL-2.0
/*
* Reproduce the uswsusp SNAPSHOT_SET_SWAP_AREA pin lifetime regression.
*
* This targets the bug introduced after hibernation swap pinning was added:
* a failed SNAPSHOT_SET_SWAP_AREA() could drop the existing pin, letting a
* subsequent swapoff() succeed while /dev/snapshot was still open.
*
* Run only inside a throwaway VM. The test manipulates swap state and leaves
* the target swap area disabled on success.
*/
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <linux/types.h>
#include <linux/suspend_ioctls.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/swap.h>
#include <sys/sysmacros.h>
#include <unistd.h>
static void print_usage(const char *prog)
{
fprintf(stderr,
"usage: %s <active-swap-dev-1> <active-swap-dev-2> <bogus-block-dev>\n"
" <active-swap-dev-1> must be an active swap block device.\n"
" <active-swap-dev-2> must be a second active swap block device.\n"
" <bogus-block-dev> must be a block device that is not a swap area.\n",
prog);
}
static int encode_dev(dev_t dev)
{
unsigned int major_num = major(dev);
unsigned int minor_num = minor(dev);
/*
* Match the kernel's new_encode_dev() layout; SNAPSHOT_SET_SWAP_AREA
* decodes this with new_decode_dev() on the kernel side.
*/
return (major_num & 0xfff) << 8 |
(minor_num & 0xff) |
((minor_num & ~0xff) << 12);
}
static int get_block_dev(const char *path, dev_t *dev)
{
struct stat st;
if (stat(path, &st) < 0) {
fprintf(stderr, "stat(%s): %s\n", path, strerror(errno));
return -errno;
}
if (!S_ISBLK(st.st_mode)) {
fprintf(stderr, "%s is not a block device\n", path);
return -EINVAL;
}
*dev = st.st_rdev;
return 0;
}
static int snapshot_set_swap_area(int fd, dev_t dev, long long offset)
{
struct resume_swap_area area = {
.offset = offset,
.dev = encode_dev(dev),
};
if (ioctl(fd, SNAPSHOT_SET_SWAP_AREA, &area) < 0)
return -errno;
return 0;
}
int main(int argc, char **argv)
{
const char *swap_path_1, *swap_path_2, *bogus_path;
dev_t swap_dev_1, swap_dev_2, bogus_dev;
int fd, ret;
bool buggy = false;
if (argc != 4) {
print_usage(argv[0]);
return 2;
}
if (geteuid() != 0) {
fprintf(stderr, "must run as root\n");
return 2;
}
swap_path_1 = argv[1];
swap_path_2 = argv[2];
bogus_path = argv[3];
ret = get_block_dev(swap_path_1, &swap_dev_1);
if (ret < 0)
return 2;
ret = get_block_dev(swap_path_2, &swap_dev_2);
if (ret < 0)
return 2;
ret = get_block_dev(bogus_path, &bogus_dev);
if (ret < 0)
return 2;
fd = open("/dev/snapshot", O_WRONLY);
if (fd < 0) {
fprintf(stderr, "open(/dev/snapshot): %s\n", strerror(errno));
return 2;
}
ret = snapshot_set_swap_area(fd, swap_dev_1, 0);
if (ret < 0) {
fprintf(stderr, "step1: valid SNAPSHOT_SET_SWAP_AREA failed: %s\n",
strerror(-ret));
close(fd);
return 2;
}
printf("step1: pinned active swap %s\n", swap_path_1);
if (swapoff(swap_path_1) == 0) {
fprintf(stderr,
"step2: swapoff(%s) unexpectedly succeeded while pinned\n",
swap_path_1);
close(fd);
return 1;
}
if (errno != EBUSY) {
fprintf(stderr,
"step2: swapoff(%s) failed with %s, expected EBUSY\n",
swap_path_1, strerror(errno));
close(fd);
return 2;
}
printf("step2: swapoff blocked with EBUSY while pin is held\n");
ret = snapshot_set_swap_area(fd, swap_dev_2, 0);
if (ret < 0) {
fprintf(stderr,
"step3: second valid SNAPSHOT_SET_SWAP_AREA failed: %s\n",
strerror(-ret));
close(fd);
return 2;
}
printf("step3: repinned active swap to %s\n", swap_path_2);
if (swapoff(swap_path_1) < 0) {
fprintf(stderr,
"step4: swapoff(%s) failed after repin: %s\n",
swap_path_1, strerror(errno));
close(fd);
return 2;
}
printf("step4: swapoff(%s) succeeded after repinning away\n",
swap_path_1);
if (swapoff(swap_path_2) == 0) {
fprintf(stderr,
"step5: swapoff(%s) unexpectedly succeeded while pinned\n",
swap_path_2);
close(fd);
return 1;
}
if (errno != EBUSY) {
fprintf(stderr,
"step5: swapoff(%s) failed with %s, expected EBUSY\n",
swap_path_2, strerror(errno));
close(fd);
return 2;
}
printf("step5: repinned swap is blocked with EBUSY\n");
ret = snapshot_set_swap_area(fd, bogus_dev, 0);
if (!ret) {
fprintf(stderr,
"step6: bogus SNAPSHOT_SET_SWAP_AREA unexpectedly succeeded\n");
close(fd);
return 2;
}
printf("step6: bogus SNAPSHOT_SET_SWAP_AREA failed as expected: %s\n",
strerror(-ret));
if (swapoff(swap_path_2) == 0) {
printf("step7: swapoff(%s) succeeded after failed re-set\n",
swap_path_2);
printf("result: BUGGY kernel, hibernation pin was dropped\n");
buggy = true;
} else if (errno == EBUSY) {
printf("step7: swapoff(%s) is still blocked with EBUSY\n",
swap_path_2);
printf("result: FIXED kernel, hibernation pin was preserved\n");
} else {
fprintf(stderr, "step7: unexpected swapoff(%s) error: %s\n",
swap_path_2, strerror(errno));
close(fd);
return 2;
}
close(fd);
if (!buggy) {
if (swapoff(swap_path_2) < 0) {
fprintf(stderr,
"step8: swapoff(%s) after close failed: %s\n",
swap_path_2, strerror(errno));
return 2;
}
printf("step8: swapoff succeeded after closing /dev/snapshot\n");
}
printf("note: re-enable swap with `swapon %s` and `swapon %s`\n",
swap_path_1, swap_path_2);
return buggy ? 1 : 0;
}
include/linux/swap.h | 1 +
kernel/power/user.c | 12 +++------
mm/swapfile.c | 61 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 66 insertions(+), 8 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1930f81e6be4..720347ae8ce1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -435,6 +435,7 @@ static inline long get_nr_swap_pages(void)
extern void si_swapinfo(struct sysinfo *);
extern int pin_hibernation_swap_type(dev_t device, sector_t offset);
+extern int repin_hibernation_swap_type(int old_type, dev_t device, sector_t offset);
extern void unpin_hibernation_swap_type(int type);
extern int find_hibernation_swap_type(dev_t device, sector_t offset);
int find_first_swap(dev_t *device);
diff --git a/kernel/power/user.c b/kernel/power/user.c
index 4406f5644a56..869371ad4a5f 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -218,6 +218,7 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
{
sector_t offset;
dev_t swdev;
+ int swap;
if (swsusp_swap_in_use())
return -EPERM;
@@ -238,19 +239,14 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
offset = swap_area.offset;
}
- /*
- * Unpin the swap device if a swap area was already
- * set by SNAPSHOT_SET_SWAP_AREA.
- */
- unpin_hibernation_swap_type(data->swap);
-
/*
* User space encodes device types as two-byte values,
* so we need to recode them
*/
- data->swap = pin_hibernation_swap_type(swdev, offset);
- if (data->swap < 0)
+ swap = repin_hibernation_swap_type(data->swap, swdev, offset);
+ if (swap < 0)
return swdev ? -ENODEV : -EINVAL;
+ data->swap = swap;
data->dev = swdev;
return 0;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c5b459a18f43..4d3b41125e6a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2215,6 +2215,67 @@ int pin_hibernation_swap_type(dev_t device, sector_t offset)
return type;
}
+/**
+ * repin_hibernation_swap_type - Retarget a hibernation pin without dropping it
+ * @old_type: Currently pinned swap type, or a negative value if none is pinned
+ * @device: Block device containing the resume image
+ * @offset: Offset identifying the swap area
+ *
+ * Locate the swap device for @device/@offset and make it the hibernation-pinned
+ * device. If @old_type already refers to the same swap area, the existing pin
+ * is kept. On failure, the previous pin is preserved.
+ *
+ * Return:
+ * >= 0 on success (new swap type).
+ * -EINVAL if @device is invalid.
+ * -ENODEV if the swap device is not found.
+ * -EBUSY if another device is already pinned for hibernation.
+ */
+int repin_hibernation_swap_type(int old_type, dev_t device, sector_t offset)
+{
+ int type;
+ struct swap_info_struct *old_si = NULL, *new_si;
+
+ spin_lock(&swap_lock);
+
+ if (old_type >= 0) {
+ old_si = swap_type_to_info(old_type);
+ if (WARN_ON_ONCE(!old_si || !(old_si->flags & SWP_HIBERNATION))) {
+ spin_unlock(&swap_lock);
+ return -EINVAL;
+ }
+ }
+
+ type = __find_hibernation_swap_type(device, offset);
+ if (type < 0) {
+ spin_unlock(&swap_lock);
+ return type;
+ }
+
+ if (type == old_type) {
+ spin_unlock(&swap_lock);
+ return type;
+ }
+
+ new_si = swap_type_to_info(type);
+ if (WARN_ON_ONCE(!new_si)) {
+ spin_unlock(&swap_lock);
+ return -ENODEV;
+ }
+
+ if (WARN_ON_ONCE(new_si->flags & SWP_HIBERNATION)) {
+ spin_unlock(&swap_lock);
+ return -EBUSY;
+ }
+
+ if (old_si)
+ old_si->flags &= ~SWP_HIBERNATION;
+ new_si->flags |= SWP_HIBERNATION;
+
+ spin_unlock(&swap_lock);
+ return type;
+}
+
/**
* unpin_hibernation_swap_type - Unpin the swap device for hibernation
* @type: Swap type previously returned by pin_hibernation_swap_type()
base-commit: 5b2b0c6e457765adbe96fb2d464ff1bcd3d72158
--
2.43.0
^ permalink raw reply related
* Re: [patch V2 11/11] alarmtimer: Remove unused interfaces
From: Frederic Weisbecker @ 2026-04-14 14:27 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, John Stultz, Stephen Boyd, Calvin Owens, Anna-Maria Behnsen,
Peter Zijlstra (Intel), Alexander Viro, Christian Brauner,
Jan Kara, linux-fsdevel, Sebastian Reichel, linux-pm,
Pablo Neira Ayuso, Florian Westphal, Phil Sutter, netfilter-devel,
coreteam
In-Reply-To: <20260408114952.670899355@kernel.org>
Le Wed, Apr 08, 2026 at 01:54:33PM +0200, Thomas Gleixner a écrit :
> All alarmtimer users are converted to alarm_start_timer(). Remove the now
> unused interfaces.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: John Stultz <jstultz@google.com>
> Cc: Stephen Boyd <sboyd@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply
* Re: [PATCH v2 2/3] pmdomain: core: add support for power-domains-child-ids
From: Ulf Hansson @ 2026-04-14 14:03 UTC (permalink / raw)
To: Kevin Hilman (TI)
Cc: Rob Herring, Geert Uytterhoeven, linux-pm, devicetree,
linux-kernel, arm-scmi, linux-arm-kernel
In-Reply-To: <20260410-topic-lpm-pmdomain-child-ids-v2-2-83396e4b5f8b@baylibre.com>
On Sat, 11 Apr 2026 at 01:44, Kevin Hilman (TI) <khilman@baylibre.com> wrote:
>
> Currently, PM domains can only support hierarchy for simple
> providers (e.g. ones with #power-domain-cells = 0).
>
> Add support for oncell providers as well by adding a new property
> `power-domains-child-ids` to describe the parent/child relationship.
>
> For example, an SCMI PM domain provider has multiple domains, each of
> which might be a child of diffeent parent domains. In this example,
> the parent domains are MAIN_PD and WKUP_PD:
>
> scmi_pds: protocol@11 {
> reg = <0x11>;
> #power-domain-cells = <1>;
> power-domains = <&MAIN_PD>, <&WKUP_PD>;
> power-domains-child-ids = <15>, <19>;
> };
>
> With this example using the new property, SCMI PM domain 15 becomes a
> child domain of MAIN_PD, and SCMI domain 19 becomes a child domain of
> WKUP_PD.
>
> To support this feature, add two new core functions
>
> - of_genpd_add_child_ids()
> - of_genpd_remove_child_ids()
>
> which can be called by pmdomain providers to add/remove child domains
> if they support the new property power-domains-child-ids.
>
> The add function is "all or nothing". If it cannot add all of the
> child domains in the list, it will unwind any additions already made
> and report a failure.
>
> Signed-off-by: Kevin Hilman (TI) <khilman@baylibre.com>
> ---
> drivers/pmdomain/core.c | 166 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/pm_domain.h | 16 ++++++++++++++++
> 2 files changed, 182 insertions(+)
>
> diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
> index 61c2277c9ce3..f978477dd546 100644
> --- a/drivers/pmdomain/core.c
> +++ b/drivers/pmdomain/core.c
> @@ -2909,6 +2909,172 @@ static struct generic_pm_domain *genpd_get_from_provider(
> return genpd;
> }
>
> +/**
> + * of_genpd_add_child_ids() - Parse power-domains-child-ids property
> + * @np: Device node pointer associated with the PM domain provider.
> + * @data: Pointer to the onecell data associated with the PM domain provider.
> + *
> + * Parse the power-domains and power-domains-child-ids properties to establish
> + * parent-child relationships for PM domains. The power-domains property lists
> + * parent domains, and power-domains-child-ids lists which child domain IDs
> + * should be associated with each parent.
> + *
> + * Uses "all or nothing" semantics: either all relationships are established
> + * successfully, or none are (any partially-added relationships are unwound
> + * on error).
> + *
> + * Returns 0 on success, -ENOENT if properties don't exist, or negative error code.
> + */
As I mentioned in my earlier reply for the previous version, returning
a specific error code when the property doesn't exist will complicate
handling for the caller. Moreover, we also need to make sure we don't
returning the same error code (-ENOENT) for a different error further
down the execution path in of_genpd_add_child_ids(). Otherwise it
would the caller treat the error code in the wrong way.
To me, there are two better ways to address this. For both options,
of_genpd_add_child_ids() should return 0 when
"power-domains-child-ids" is missing.
1) Add another helper function that checks if
"power-domains-child-ids" exists. The caller can then use this to
pre-parse the property and decide whether to treat it as an error.
2) As I suggested earlier, let of_genpd_add_child_ids() return the
number of assigned parents/children, while still using the all or
nothing approach, of course.
Kind regards
Uffe
> +int of_genpd_add_child_ids(struct device_node *np,
> + struct genpd_onecell_data *data)
> +{
> + struct of_phandle_args parent_args;
> + struct generic_pm_domain *parent_genpd, *child_genpd;
> + struct generic_pm_domain **pairs; /* pairs[2*i]=parent, pairs[2*i+1]=child */
> + u32 child_id;
> + int i, ret, count, child_count, added = 0;
> +
> + /* Check if both properties exist */
> + count = of_count_phandle_with_args(np, "power-domains", "#power-domain-cells");
> + if (count <= 0)
> + return -ENOENT;
> +
> + child_count = of_property_count_u32_elems(np, "power-domains-child-ids");
> + if (child_count < 0)
> + return -ENOENT;
> + if (child_count != count)
> + return -EINVAL;
> +
> + /* Allocate tracking array for error unwind (parent/child pairs) */
> + pairs = kmalloc_array(count * 2, sizeof(*pairs), GFP_KERNEL);
> + if (!pairs)
> + return -ENOMEM;
> +
> + for (i = 0; i < count; i++) {
> + ret = of_property_read_u32_index(np, "power-domains-child-ids",
> + i, &child_id);
> + if (ret)
> + goto err_unwind;
> +
> + /* Validate child ID is within bounds */
> + if (child_id >= data->num_domains) {
> + pr_err("Child ID %u out of bounds (max %u) for %pOF\n",
> + child_id, data->num_domains - 1, np);
> + ret = -EINVAL;
> + goto err_unwind;
> + }
> +
> + /* Get the child domain */
> + child_genpd = data->domains[child_id];
> + if (!child_genpd) {
> + pr_err("Child domain %u is NULL for %pOF\n", child_id, np);
> + ret = -EINVAL;
> + goto err_unwind;
> + }
> +
> + ret = of_parse_phandle_with_args(np, "power-domains",
> + "#power-domain-cells", i,
> + &parent_args);
> + if (ret)
> + goto err_unwind;
> +
> + /* Get the parent domain */
> + parent_genpd = genpd_get_from_provider(&parent_args);
> + of_node_put(parent_args.np);
> + if (IS_ERR(parent_genpd)) {
> + pr_err("Failed to get parent domain for %pOF: %ld\n",
> + np, PTR_ERR(parent_genpd));
> + ret = PTR_ERR(parent_genpd);
> + goto err_unwind;
> + }
> +
> + /* Establish parent-child relationship */
> + ret = pm_genpd_add_subdomain(parent_genpd, child_genpd);
> + if (ret) {
> + pr_err("Failed to add child domain %u to parent in %pOF: %d\n",
> + child_id, np, ret);
> + goto err_unwind;
> + }
> +
> + /* Track for potential unwind */
> + pairs[2 * added] = parent_genpd;
> + pairs[2 * added + 1] = child_genpd;
> + added++;
> +
> + pr_debug("Added child domain %u (%s) to parent %s for %pOF\n",
> + child_id, child_genpd->name, parent_genpd->name, np);
> + }
> +
> + kfree(pairs);
> + return 0;
> +
> +err_unwind:
> + /* Reverse all previously established relationships */
> + while (added-- > 0)
> + pm_genpd_remove_subdomain(pairs[2 * added], pairs[2 * added + 1]);
> + kfree(pairs);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(of_genpd_add_child_ids);
> +
> +/**
> + * of_genpd_remove_child_ids() - Remove parent-child PM domain relationships
> + * @np: Device node pointer associated with the PM domain provider.
> + * @data: Pointer to the onecell data associated with the PM domain provider.
> + *
> + * Reverses the effect of of_genpd_add_child_ids() by parsing the same
> + * power-domains and power-domains-child-ids properties and calling
> + * pm_genpd_remove_subdomain() for each established relationship.
> + *
> + * Returns 0 on success, -ENOENT if properties don't exist, or negative error
> + * code on failure.
> + */
> +int of_genpd_remove_child_ids(struct device_node *np,
> + struct genpd_onecell_data *data)
> +{
> + struct of_phandle_args parent_args;
> + struct generic_pm_domain *parent_genpd, *child_genpd;
> + u32 child_id;
> + int i, ret, count, child_count;
> +
> + /* Check if both properties exist */
> + count = of_count_phandle_with_args(np, "power-domains", "#power-domain-cells");
> + if (count <= 0)
> + return -ENOENT;
> +
> + child_count = of_property_count_u32_elems(np, "power-domains-child-ids");
> + if (child_count < 0)
> + return -ENOENT;
> + if (child_count != count)
> + return -EINVAL;
> +
> + for (i = 0; i < count; i++) {
> + if (of_property_read_u32_index(np, "power-domains-child-ids",
> + i, &child_id))
> + continue;
> +
> + if (child_id >= data->num_domains || !data->domains[child_id])
> + continue;
> +
> + ret = of_parse_phandle_with_args(np, "power-domains",
> + "#power-domain-cells", i,
> + &parent_args);
> + if (ret)
> + continue;
> +
> + parent_genpd = genpd_get_from_provider(&parent_args);
> + of_node_put(parent_args.np);
> + if (IS_ERR(parent_genpd))
> + continue;
> +
> + child_genpd = data->domains[child_id];
> + pm_genpd_remove_subdomain(parent_genpd, child_genpd);
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(of_genpd_remove_child_ids);
> +
> /**
> * of_genpd_add_device() - Add a device to an I/O PM domain
> * @genpdspec: OF phandle args to use for look-up PM domain
> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> index f67a2cb7d781..b44615d79af6 100644
> --- a/include/linux/pm_domain.h
> +++ b/include/linux/pm_domain.h
> @@ -465,6 +465,10 @@ struct generic_pm_domain *of_genpd_remove_last(struct device_node *np);
> int of_genpd_parse_idle_states(struct device_node *dn,
> struct genpd_power_state **states, int *n);
> void of_genpd_sync_state(struct device_node *np);
> +int of_genpd_add_child_ids(struct device_node *np,
> + struct genpd_onecell_data *data);
> +int of_genpd_remove_child_ids(struct device_node *np,
> + struct genpd_onecell_data *data);
>
> int genpd_dev_pm_attach(struct device *dev);
> struct device *genpd_dev_pm_attach_by_id(struct device *dev,
> @@ -534,6 +538,18 @@ struct generic_pm_domain *of_genpd_remove_last(struct device_node *np)
> {
> return ERR_PTR(-EOPNOTSUPP);
> }
> +
> +static inline int of_genpd_add_child_ids(struct device_node *np,
> + struct genpd_onecell_data *data)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static inline int of_genpd_remove_child_ids(struct device_node *np,
> + struct genpd_onecell_data *data)
> +{
> + return -EOPNOTSUPP;
> +}
> #endif /* CONFIG_PM_GENERIC_DOMAINS_OF */
>
> #ifdef CONFIG_PM
>
> --
> 2.51.0
>
^ permalink raw reply
* Re: [PATCH] tools/power turbostat: Allow execution to continue after perf_l2_init() failure
From: David Arcari @ 2026-04-14 13:48 UTC (permalink / raw)
To: Mi, Dapeng1, Len Brown; +Cc: Linux PM list, Linux Kernel Mailing List
In-Reply-To: <MN0PR11MB61604669C6DF6D1F372108EACD252@MN0PR11MB6160.namprd11.prod.outlook.com>
kernel is a Fedora version of 7.0.0-0.rc4
I have two ADL systems that exhibit the same behavior.
# grep . /sys/devices/cpu_core/events/*
/sys/devices/cpu_core/events/branch-instructions:event=0xc4
/sys/devices/cpu_core/events/branch-misses:event=0xc5
/sys/devices/cpu_core/events/bus-cycles:event=0x3c,umask=0x01
/sys/devices/cpu_core/events/cache-misses:event=0x2e,umask=0x41
/sys/devices/cpu_core/events/cache-references:event=0x2e,umask=0x4f
/sys/devices/cpu_core/events/cpu-cycles:event=0x3c
/sys/devices/cpu_core/events/instructions:event=0xc0
/sys/devices/cpu_core/events/mem-loads:event=0xcd,umask=0x1,ldlat=3
/sys/devices/cpu_core/events/mem-loads-aux:event=0x03,umask=0x82
/sys/devices/cpu_core/events/mem-stores:event=0xcd,umask=0x2
/sys/devices/cpu_core/events/ref-cycles:event=0x3c,umask=0x01
/sys/devices/cpu_core/events/slots:event=0x00,umask=0x4
/sys/devices/cpu_core/events/topdown-bad-spec:event=0x00,umask=0x81
/sys/devices/cpu_core/events/topdown-be-bound:event=0x00,umask=0x83
/sys/devices/cpu_core/events/topdown-br-mispredict:event=0x00,umask=0x85
/sys/devices/cpu_core/events/topdown-fe-bound:event=0x00,umask=0x82
/sys/devices/cpu_core/events/topdown-fetch-lat:event=0x00,umask=0x86
/sys/devices/cpu_core/events/topdown-heavy-ops:event=0x00,umask=0x84
/sys/devices/cpu_core/events/topdown-mem-bound:event=0x00,umask=0x87
/sys/devices/cpu_core/events/topdown-retiring:event=0x00,umask=0x80
# grep . /sys/devices/cpu_atom/events/*
/sys/devices/cpu_atom/events/branch-instructions:event=0xc4
/sys/devices/cpu_atom/events/branch-misses:event=0xc5
/sys/devices/cpu_atom/events/bus-cycles:event=0x3c,umask=0x01
/sys/devices/cpu_atom/events/cache-misses:event=0x2e,umask=0x41
/sys/devices/cpu_atom/events/cache-references:event=0x2e,umask=0x4f
/sys/devices/cpu_atom/events/cpu-cycles:event=0x3c
/sys/devices/cpu_atom/events/instructions:event=0xc0
grep: /sys/devices/cpu_atom/events/mem-loads: binary file matches
grep: /sys/devices/cpu_atom/events/mem-stores: binary file matches
/sys/devices/cpu_atom/events/ref-cycles:event=0x3c,umask=0x01
grep: /sys/devices/cpu_atom/events/topdown-bad-spec: binary file matches
grep: /sys/devices/cpu_atom/events/topdown-be-bound: binary file matches
grep: /sys/devices/cpu_atom/events/topdown-fe-bound: binary file matches
grep: /sys/devices/cpu_atom/events/topdown-retiring: binary file matches
On 4/13/26 9:22 PM, Mi, Dapeng1 wrote:
> It looks strange. Suppose cache-misses event doesn't need to assign the
> offcore_rsp field. Could you please run below commands ?
>
> grep . /sys/devices/cpu_core/events/*
> grep . /sys/devices/cpu_atom/events/*
>
> BTW, which kernel did you use? I would find a ADL and check if it can be
> reproduced. Thanks.
>
> ------------------------------------------------------------------------
> *From:* David Arcari <darcari@redhat.com>
> *Sent:* Monday, April 13, 2026 7:53 PM
> *To:* Mi, Dapeng1 <dapeng1.mi@intel.com>; Len Brown <lenb@kernel.org>
> *Cc:* Linux PM list <linux-pm@vger.kernel.org>; Linux Kernel Mailing
> List <linux-kernel@vger.kernel.org>
> *Subject:* Re: [PATCH] tools/power turbostat: Allow execution to
> continue after perf_l2_init() failure
>
>
> Here is the -vvv output:
>
> # sudo perf stat -e cache-misses -vvv sleep 1
> Control descriptor is not initialized
> Opening: cache-misses
> ------------------------------------------------------------
> perf_event_attr:
> type 10 (cpu_atom)
> size 144
> unknown term 'offcore_rsp' for pmu 'cpu_atom' (valid terms:
> event,pc,edge,inv,umask,cmask,config,config1,config2,config3,config4,name,period,percore,metric-id,cpu)
> unknown term 'offcore_rsp' for pmu 'cpu_atom' (valid terms:
> event,pc,edge,inv,umask,cmask,config,config1,config2,config3,config4,name,period,percore,metric-id,cpu)
> unknown term 'offcore_rsp' for pmu 'cpu_atom' (valid terms:
> event,pc,edge,inv,umask,cmask,config,config1,config2,config3,config4,name,period,percore,metric-id,cpu)
> unknown term 'offcore_rsp' for pmu 'cpu_atom' (valid terms:
> event,pc,edge,inv,umask,cmask,config,config1,config2,config3,config4,name,period,percore,metric-id,cpu)
> unknown term 'offcore_rsp' for pmu 'cpu_atom' (valid terms:
> event,pc,edge,inv,umask,cmask,config,config1,config2,config3,config4,name,period,percore,metric-id,cpu)
> unknown term 'offcore_rsp' for pmu 'cpu_atom' (valid terms:
> event,pc,edge,inv,umask,cmask,config,config1,config2,config3,config4,name,period,percore,metric-id,cpu)
> unknown term 'ldlat' for pmu 'cpu_atom' (valid terms:
> event,pc,edge,inv,umask,cmask,config,config1,config2,config3,config4,name,period,percore,metric-id,cpu)
> unknown term 'offcore_rsp' for pmu 'cpu_atom' (valid terms:
> event,pc,edge,inv,umask,cmask,config,config1,config2,config3,config4,name,period,percore,metric-id,cpu)
> unknown term 'offcore_rsp' for pmu 'cpu_atom' (valid terms:
> event,pc,edge,inv,umask,cmask,config,config1,config2,config3,config4,name,period,percore,metric-id,cpu)
> config 0x412e (cache-misses)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 9255 cpu -1 group_fd -1 flags 0x8
> sys_perf_event_open failed, error -12
> Warning:
> skipping event cache-misses that kernel failed to open.
> The sys_perf_event_open() syscall failed for event (cache-misses):
> Cannot allocate memory
> "dmesg | grep -i perf" may provide additional information.
>
> Opening: cache-misses
> ------------------------------------------------------------
> perf_event_attr:
> type 4 (cpu_core)
> size 144
> config 0x412e (cache-misses)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 9255 cpu -1 group_fd -1 flags 0x8 = 3
> cache-misses: -1: 11224 321150 321150
> failed to read counter cache-misses
> cache-misses: 11224 321150 321150
>
> Performance counter stats for 'sleep 1':
>
> <not supported> cpu_atom/cache-misses/
>
> 11,224 cpu_core/cache-misses/
>
>
> 1.003110088 seconds time elapsed
>
> 0.000000000 seconds user
> 0.000846000 seconds sys
>
>
> On 4/12/26 8:34 PM, Mi, Dapeng1 wrote:
> > The most possible reason for the "<not supported>" is that the "sleep 1"
> > process runs on a P-core where "cpu_atom/cache-misses/" can't be
> > supported. It's expected.
> >
> > BTW, you can add "-vvv" option which gives more details like "sudo perf
> > stat -e cache-misses -vvv sleep 1".
> >
> > Thanks.
> >
> > ------------------------------------------------------------------------
> > *From:* Len Brown <lenb@kernel.org>
> > *Sent:* Saturday, April 11, 2026 2:09 AM
> > *To:* Arcari, David <darcari@redhat.com>; Mi, Dapeng1
> <dapeng1.mi@intel.com>
> > *Cc:* Linux PM list <linux-pm@vger.kernel.org>; Linux Kernel Mailing
> > List <linux-kernel@vger.kernel.org>
> > *Subject:* Re: [PATCH] tools/power turbostat: Allow execution to
> > continue after perf_l2_init() failure
> >
> > On Fri, Apr 10, 2026 at 12:06 PM David Arcari <darcari@redhat.com> wrote:
> >
> > > I'm using a Fedora kernel:
> > >
> > > vmlinuz-7.0.0-0.rc4.260320g0e4f8f1a3d08.40.eln155.x86_64
> > >
> > > And turbostat is:
> > >
> > > # turbostat -v
> > > turbostat version 2026.02.14 - Len Brown <lenb@kernel.org>
> > >
> > > >
> > > > You can poke with "perf stat" as well, but this will depend on what
> > > > .json counter list is compiled into
> > > > your version of perf.
> > > >
> > > > probably a first sanity check would be if these commands for the LLC
> > > > and the L2 work:
> > > >
> > > > sudo perf stat -e cache-misses sleep 1
> > > > sudo perf stat -e L2_REQUEST.ALL sleep 1
> > >
> > > # sudo perf stat -e cache-misses sleep 1
> > >
> > > Performance counter stats for 'sleep 1':
> > >
> > > <not supported> cpu_atom/cache-misses/
> >
> > I think this should work. There may be an issue either with
> > the perf utility or the perf kernel support on that system.
> >
> > I'll cc Dapeng. Already the weekend where he is, but maybe he
> > can give us some perf insight next week.
> >
> > thx,
> > -Len
>
^ permalink raw reply
* Re: [PATCH 2/3] pmdomain: core: add support for power-domains-child-ids
From: Ulf Hansson @ 2026-04-14 13:42 UTC (permalink / raw)
To: Kevin Hilman
Cc: Rob Herring, Geert Uytterhoeven, linux-pm, devicetree,
linux-kernel, arm-scmi, linux-arm-kernel
In-Reply-To: <7hqzomqwpv.fsf@baylibre.com>
On Sat, 11 Apr 2026 at 00:25, Kevin Hilman <khilman@baylibre.com> wrote:
>
> Ulf Hansson <ulf.hansson@linaro.org> writes:
>
> > On Fri, 10 Apr 2026 at 02:45, Kevin Hilman <khilman@baylibre.com> wrote:
> >>
> >> Ulf Hansson <ulf.hansson@linaro.org> writes:
> >>
> >> > On Wed, 11 Mar 2026 at 01:19, Kevin Hilman (TI) <khilman@baylibre.com> wrote:
> >> >>
> >> >> Currently, PM domains can only support hierarchy for simple
> >> >> providers (e.g. ones with #power-domain-cells = 0).
> >> >>
> >> >> Add support for oncell providers as well by adding a new property
> >> >> `power-domains-child-ids` to describe the parent/child relationship.
> >> >>
> >> >> For example, an SCMI PM domain provider has multiple domains, each of
> >> >> which might be a child of diffeent parent domains. In this example,
> >> >> the parent domains are MAIN_PD and WKUP_PD:
> >> >>
> >> >> scmi_pds: protocol@11 {
> >> >> reg = <0x11>;
> >> >> #power-domain-cells = <1>;
> >> >> power-domains = <&MAIN_PD>, <&WKUP_PD>;
> >> >> power-domains-child-ids = <15>, <19>;
> >> >> };
> >> >>
> >> >> With this example using the new property, SCMI PM domain 15 becomes a
> >> >> child domain of MAIN_PD, and SCMI domain 19 becomes a child domain of
> >> >> WKUP_PD.
> >> >>
> >> >> To support this feature, add two new core functions
> >> >>
> >> >> - of_genpd_add_child_ids()
> >> >> - of_genpd_remove_child_ids()
> >> >>
> >> >> which can be called by pmdomain providers to add/remove child domains
> >> >> if they support the new property power-domains-child-ids.
> >> >>
> >> >> Signed-off-by: Kevin Hilman (TI) <khilman@baylibre.com>
> >> >
> >> > Thanks for working on this! It certainly is a missing feature!
> >>
> >> You're welcome, thanks for the detailed review.
> >>
> >> >> ---
> >> >> drivers/pmdomain/core.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >> include/linux/pm_domain.h | 16 ++++++++++++++++
> >> >> 2 files changed, 185 insertions(+)
> >> >>
> >> >> diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
> >> >> index 61c2277c9ce3..acb45dd540b7 100644
> >> >> --- a/drivers/pmdomain/core.c
> >> >> +++ b/drivers/pmdomain/core.c
> >> >> @@ -2909,6 +2909,175 @@ static struct generic_pm_domain *genpd_get_from_provider(
> >> >> return genpd;
> >> >> }
> >> >>
> >> >> +/**
> >> >> + * of_genpd_add_child_ids() - Parse power-domains-child-ids property
> >> >> + * @np: Device node pointer associated with the PM domain provider.
> >> >> + * @data: Pointer to the onecell data associated with the PM domain provider.
> >> >> + *
> >> >> + * Parse the power-domains and power-domains-child-ids properties to establish
> >> >> + * parent-child relationships for PM domains. The power-domains property lists
> >> >> + * parent domains, and power-domains-child-ids lists which child domain IDs
> >> >> + * should be associated with each parent.
> >> >> + *
> >> >> + * Returns 0 on success, -ENOENT if properties don't exist, or negative error code.
> >> >
> >> > I think we should avoid returning specific error codes for specific
> >> > errors, simply because it usually becomes messy.
> >> >
> >> > If I understand correctly the intent here is to allow the caller to
> >> > check for -ENOENT and potentially avoid bailing out as it may not
> >> > really be an error, right?
> >>
> >> Right, -ENOENT is not an error of parsing, it's to indicate that there
> >> are no child-ids to be parsed.
> >>
> >> > Perhaps a better option is to return the number of children for whom
> >> > we successfully assigned parents. Hence 0 or a positive value allows
> >> > the caller to understand what happened. More importantly, a negative
> >> > error code then really becomes an error for the caller to consider.
> >>
> >> I explored this a bit, but it gets messy quick. It means we have to
> >> track cases where only some of the children were added as well as when
> >> all children were added. Personally, I think this should be an "all or
> >> nothing" thing. If all the children cannot be parsed/added, then none
> >> of them should be added.
> >>
> >> This also allows the remove to not have to care about how many were
> >> added, and just remove them all, with the additional benefit of not
> >> having to track the state of how many children were successfully added.
> >>
> >
> > I fully agree, it should be all or nothing. Failing with one
> > child/parent should end up with an error code being returned.
> >
> > That said, it still seems to make perfect sense to return the number
> > of children for whom we assigned parents for, no?
>
> No, because what will the caller use that number for? If we are
> assuming "all or nothing", what would we use it for (other than a debug print?)
>
> It also makes it a bit confusing what a zero return value means. Does
> that mean success? Or that zero children were added (which would be
> fail.)
>
> I prefer to keep it as is.
In that case, how should we treat the scenario where the device node
lacks a "power-domains-child-ids" property? In some cases it is
probably fine, while in others it may not be.
I guess the caller of of_genpd_add_child_ids(), would then need to
pre-parse for the "power-domains-child-ids" property before deciding
to call of_genpd_add_child_ids().
At least, we don't of_genpd_add_child_ids() to return an error code if
there is no "power-domains-child-ids" in the device node, as that
would just confuse the caller.
Kind regards
Uffe
^ permalink raw reply
* Re: [PATCH v2 2/2] riscv: dts: spacemit: Add cpu scaling for K1 SoC
From: Anand Moon @ 2026-04-14 13:25 UTC (permalink / raw)
To: Shuwei Wu
Cc: Rafael J. Wysocki, Viresh Kumar, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Alexandre Ghiti, Yixun Lan, linux-pm, linux-kernel, linux-riscv,
spacemit, devicetree
In-Reply-To: <20260410-shadow-deps-v2-2-4e16b8c0f60e@mailbox.org>
Hi Shuwei,
On Fri, 10 Apr 2026 at 13:30, Shuwei Wu <shuwei.wu@mailbox.org> wrote:
>
> Add Operating Performance Points (OPP) tables and CPU clock properties
> for the two clusters in the SpacemiT K1 SoC.
>
> Also assign the CPU power supply (cpu-supply) for the Banana Pi BPI-F3
> board to fully enable CPU DVFS.
>
> Signed-off-by: Shuwei Wu <shuwei.wu@mailbox.org>
>
> ---
> Changes in v2:
> - Add k1-opp.dtsi with OPP tables for both CPU clusters
> - Assign CPU supplies and include OPP table for Banana Pi BPI-F3
> ---
> arch/riscv/boot/dts/spacemit/k1-bananapi-f3.dts | 35 +++++++-
> arch/riscv/boot/dts/spacemit/k1-opp.dtsi | 105 ++++++++++++++++++++++++
> arch/riscv/boot/dts/spacemit/k1.dtsi | 8 ++
> 3 files changed, 147 insertions(+), 1 deletion(-)
>
> diff --git a/arch/riscv/boot/dts/spacemit/k1-bananapi-f3.dts b/arch/riscv/boot/dts/spacemit/k1-bananapi-f3.dts
> index 444c3b1e6f44..3780593f610d 100644
> --- a/arch/riscv/boot/dts/spacemit/k1-bananapi-f3.dts
> +++ b/arch/riscv/boot/dts/spacemit/k1-bananapi-f3.dts
> @@ -5,6 +5,7 @@
>
> #include "k1.dtsi"
> #include "k1-pinctrl.dtsi"
> +#include "k1-opp.dtsi"
>
> / {
> model = "Banana Pi BPI-F3";
> @@ -86,6 +87,38 @@ &combo_phy {
> status = "okay";
> };
>
> +&cpu_0 {
> + cpu-supply = <&buck1_3v45>;
> +};
> +
> +&cpu_1 {
> + cpu-supply = <&buck1_3v45>;
> +};
> +
> +&cpu_2 {
> + cpu-supply = <&buck1_3v45>;
> +};
> +
> +&cpu_3 {
> + cpu-supply = <&buck1_3v45>;
> +};
> +
> +&cpu_4 {
> + cpu-supply = <&buck1_3v45>;
> +};
> +
> +&cpu_5 {
> + cpu-supply = <&buck1_3v45>;
> +};
> +
> +&cpu_6 {
> + cpu-supply = <&buck1_3v45>;
> +};
> +
> +&cpu_7 {
> + cpu-supply = <&buck1_3v45>;
> +};
> +
> &emmc {
> bus-width = <8>;
> mmc-hs400-1_8v;
> @@ -201,7 +234,7 @@ pmic@41 {
> dldoin2-supply = <&buck5>;
>
> regulators {
> - buck1 {
> + buck1_3v45: buck1 {
> regulator-min-microvolt = <500000>;
> regulator-max-microvolt = <3450000>;
> regulator-ramp-delay = <5000>;
> diff --git a/arch/riscv/boot/dts/spacemit/k1-opp.dtsi b/arch/riscv/boot/dts/spacemit/k1-opp.dtsi
> new file mode 100644
> index 000000000000..768ae390686d
> --- /dev/null
> +++ b/arch/riscv/boot/dts/spacemit/k1-opp.dtsi
> @@ -0,0 +1,105 @@
> +// SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> +
> +/ {
> + cluster0_opp_table: opp-table-cluster0 {
> + compatible = "operating-points-v2";
> + opp-shared;
> +
> + opp-614400000 {
> + opp-hz = /bits/ 64 <614400000>;
> + opp-microvolt = <950000>;
> + clock-latency-ns = <200000>;
> + };
> +
> + opp-819000000 {
> + opp-hz = /bits/ 64 <819000000>;
> + opp-microvolt = <950000>;
> + clock-latency-ns = <200000>;
> + };
> +
> + opp-1000000000 {
> + opp-hz = /bits/ 64 <1000000000>;
> + opp-microvolt = <950000>;
> + clock-latency-ns = <200000>;
> + };
> +
> + opp-1228800000 {
> + opp-hz = /bits/ 64 <1228800000>;
> + opp-microvolt = <950000>;
> + clock-latency-ns = <200000>;
> + };
> +
> + opp-1600000000 {
> + opp-hz = /bits/ 64 <1600000000>;
> + opp-microvolt = <1050000>;
> + clock-latency-ns = <200000>;
> + };
> + };
> +
> + cluster1_opp_table: opp-table-cluster1 {
> + compatible = "operating-points-v2";
> + opp-shared;
> +
> + opp-614400000 {
> + opp-hz = /bits/ 64 <614400000>;
> + opp-microvolt = <950000>;
> + clock-latency-ns = <200000>;
> + };
> +
> + opp-819000000 {
> + opp-hz = /bits/ 64 <819000000>;
> + opp-microvolt = <950000>;
> + clock-latency-ns = <200000>;
> + };
> +
> + opp-1000000000 {
> + opp-hz = /bits/ 64 <1000000000>;
> + opp-microvolt = <950000>;
> + clock-latency-ns = <200000>;
> + };
> +
> + opp-1228800000 {
> + opp-hz = /bits/ 64 <1228800000>;
> + opp-microvolt = <950000>;
> + clock-latency-ns = <200000>;
> + };
> +
> + opp-1600000000 {
> + opp-hz = /bits/ 64 <1600000000>;
> + opp-microvolt = <1050000>;
> + clock-latency-ns = <200000>;
> + };
> + };
> +};
> +
> +&cpu_0 {
> + operating-points-v2 = <&cluster0_opp_table>;
> +};
> +
> +&cpu_1 {
> + operating-points-v2 = <&cluster0_opp_table>;
> +};
> +
> +&cpu_2 {
> + operating-points-v2 = <&cluster0_opp_table>;
> +};
> +
> +&cpu_3 {
> + operating-points-v2 = <&cluster0_opp_table>;
> +};
> +
> +&cpu_4 {
> + operating-points-v2 = <&cluster1_opp_table>;
> +};
> +
> +&cpu_5 {
> + operating-points-v2 = <&cluster1_opp_table>;
> +};
> +
> +&cpu_6 {
> + operating-points-v2 = <&cluster1_opp_table>;
> +};
> +
> +&cpu_7 {
> + operating-points-v2 = <&cluster1_opp_table>;
> +};
> diff --git a/arch/riscv/boot/dts/spacemit/k1.dtsi b/arch/riscv/boot/dts/spacemit/k1.dtsi
> index 529ec68e9c23..bdd109b81730 100644
> --- a/arch/riscv/boot/dts/spacemit/k1.dtsi
> +++ b/arch/riscv/boot/dts/spacemit/k1.dtsi
> @@ -54,6 +54,7 @@ cpu_0: cpu@0 {
> compatible = "spacemit,x60", "riscv";
> device_type = "cpu";
> reg = <0>;
> + clocks = <&syscon_apmu CLK_CPU_C0_CORE>;
> riscv,isa = "rv64imafdcbv_zicbom_zicbop_zicboz_zicntr_zicond_zicsr_zifencei_zihintpause_zihpm_zfh_zba_zbb_zbc_zbs_zkt_zvfh_zvkt_sscofpmf_sstc_svinval_svnapot_svpbmt";
> riscv,isa-base = "rv64i";
> riscv,isa-extensions = "i", "m", "a", "f", "d", "c", "b", "v", "zicbom",
> @@ -84,6 +85,7 @@ cpu_1: cpu@1 {
> compatible = "spacemit,x60", "riscv";
> device_type = "cpu";
> reg = <1>;
> + clocks = <&syscon_apmu CLK_CPU_C0_CORE>;
Based on the Spacemit kernel source, the k1-x_opp_table.dtsi file
defines several additional clocks for the Operating Performance Points
(OPP) table:
clocks = <&ccu CLK_CPU_C0_ACE>, <&ccu CLK_CPU_C1_ACE>, <&ccu CLK_CPU_C0_TCM>,
<&ccu CLK_CCI550>, <&ccu CLK_PLL3>, <&ccu
CLK_CPU_C0_HI>, <&ccu CLK_CPU_C1_HI>;
clock-names = "ace0","ace1","tcm","cci","pll3", "c0hi", "c1hi";
These hardware clocks are also explicitly registered in the APMU clock driver
via the k1_ccu_apmu_hws array, confirming their availability for frequency
and voltage scaling on the K1-X SoC.
static struct clk_hw *k1_ccu_apmu_hws[] = {
[CLK_CCI550] = &cci550_clk.common.hw,
[CLK_CPU_C0_HI] = &cpu_c0_hi_clk.common.hw,
[CLK_CPU_C0_CORE] = &cpu_c0_core_clk.common.hw,
[CLK_CPU_C0_ACE] = &cpu_c0_ace_clk.common.hw,
[CLK_CPU_C0_TCM] = &cpu_c0_tcm_clk.common.hw,
[CLK_CPU_C1_HI] = &cpu_c1_hi_clk.common.hw,
[CLK_CPU_C1_CORE] = &cpu_c1_core_clk.common.hw,
[CLK_CPU_C1_ACE] = &cpu_c1_ace_clk.common.hw,
Yes, it is possible to add these clocks for DVFS to work correctly,
provided they are managed by the appropriate driver and declared in
the Device Tree (DT).
Thanks
-Anand
^ permalink raw reply
* Re: [PATCH v4 00/11] Add support for the TI BQ25792 battery charger
From: Mark Brown @ 2026-04-14 13:07 UTC (permalink / raw)
To: Lee Jones, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Chris Morgan, Liam Girdwood, Sebastian Reichel, Alexey Charkov
Cc: devicetree, linux-kernel, Sebastian Reichel, linux-pm,
Krzysztof Kozlowski, stable
In-Reply-To: <20260311-bq25792-v4-0-7213415d9eec@flipper.net>
On Wed, 11 Mar 2026 15:56:13 +0400, Alexey Charkov wrote:
> This adds support for the TI BQ25792 battery charger, which is similar in
> overall logic to the BQ25703A, but has a different register layout and
> slightly different lower-level programming logic.
>
> The series is organized as follows:
> - Patch 1 adds the new variant to the existing DT binding, including the
> changes in electrical characteristics
> - Patches 2-4 are minor cleanups to the existing BQ25703A OTG regulator
> driver, slimming down the code and making it more reusable for the new
> BQ25792 variant
> - Patch 5 is a logical fix to the BQ25703A clamping logic for VSYSMIN
> (this is a standalone fix which can be applied independently and may be
> backported to stable)
> - Patches 6-8 are slight refactoring of the existing BQ25703A charger
> driver to make it more reusable for the new BQ25792 variant
> - Patch 9 adds platform data to distinguish between the two variants in
> the parent MFD driver, and binds it to the new compatible string
> - Patches 10-11 add variant-specific code to support the new BQ25792
> variant in the regulator part and the charger part respectively,
> selected by the platform data added in patch 9
>
> [...]
Applied to
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator.git for-next
Thanks!
[01/11] dt-bindings: mfd: ti,bq25703a: Expand to include BQ25792
(no commit info)
[02/11] regulator: bq257xx: Remove reference to the parent MFD's dev
https://git.kernel.org/broonie/misc/c/aef4d87f2c1f
[03/11] regulator: bq257xx: Drop the regulator_dev from the driver data
(no commit info)
[04/11] regulator: bq257xx: Make OTG enable GPIO really optional
https://git.kernel.org/broonie/misc/c/de76a763805d
[05/11] power: supply: bq257xx: Fix VSYSMIN clamping logic
(no commit info)
[06/11] power: supply: bq257xx: Make the default current limit a per-chip attribute
(no commit info)
[07/11] power: supply: bq257xx: Consistently use indirect get/set helpers
(no commit info)
[08/11] power: supply: bq257xx: Add fields for 'charging' and 'overvoltage' states
(no commit info)
[09/11] mfd: bq257xx: Add BQ25792 support
(no commit info)
[10/11] regulator: bq257xx: Add support for BQ25792
(no commit info)
[11/11] power: supply: bq257xx: Add support for BQ25792
(no commit info)
All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.
You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.
If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.
Please add any relevant lists and maintainers to the CCs when replying
to this mail.
Thanks,
Mark
^ permalink raw reply
* Re: [PATCH v2 1/8] dt-bindings: thermal: amlogic: Add support for T7
From: Ronald Claveau @ 2026-04-14 12:33 UTC (permalink / raw)
To: Conor Dooley
Cc: Guillaume La Roque, Rafael J. Wysocki, Daniel Lezcano, Zhang Rui,
Lukasz Luba, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Neil Armstrong, Kevin Hilman, Jerome Brunet, Martin Blumenstingl,
linux-pm, linux-amlogic, devicetree, linux-kernel,
linux-arm-kernel
In-Reply-To: <20260413-impose-cartel-bd7d18f91a24@spud>
On 4/13/26 5:42 PM, Conor Dooley wrote:
> On Mon, Apr 13, 2026 at 12:52:42PM +0200, Ronald Claveau wrote:
>> Add the amlogic,t7-thermal compatible for the Amlogic T7 thermal sensor.
>>
>> Unlike existing variants which use a phandle to the ao-secure syscon,
>> the T7 relies on a secure monitor interface described by a phandle and
>> a sensor index argument.
>>
>> The T7 integrates multiple thermal sensors, all accessed through the
>> same SMC call. The sensor index argument is required to identify which
>> sensor's calibration data the secure monitor should return, as a single
>> SM_THERMAL_CALIB_READ command serves all of them.
>>
>> Introduce the amlogic,secure-monitor property as a phandle-array and
>> make amlogic,ao-secure or amlogic,secure-monitor conditionally required
>> depending on the compatible.
>>
>> Signed-off-by: Ronald Claveau <linux-kernel-dev@aliel.fr>
>> ---
>> .../bindings/thermal/amlogic,thermal.yaml | 42 ++++++++++++++++++++--
>> 1 file changed, 40 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/devicetree/bindings/thermal/amlogic,thermal.yaml b/Documentation/devicetree/bindings/thermal/amlogic,thermal.yaml
>> index 70b273271754b..1c096116b2dda 100644
>> --- a/Documentation/devicetree/bindings/thermal/amlogic,thermal.yaml
>> +++ b/Documentation/devicetree/bindings/thermal/amlogic,thermal.yaml
>> @@ -21,7 +21,9 @@ properties:
>> - amlogic,g12a-cpu-thermal
>> - amlogic,g12a-ddr-thermal
>> - const: amlogic,g12a-thermal
>> - - const: amlogic,a1-cpu-thermal
>> + - enum:
>> + - amlogic,a1-cpu-thermal
>> + - amlogic,t7-thermal
>>
>> reg:
>> maxItems: 1
>> @@ -42,12 +44,39 @@ properties:
>> '#thermal-sensor-cells':
>> const: 0
>>
>> + amlogic,secure-monitor:
>> + description: phandle to the secure monitor
>> + $ref: /schemas/types.yaml#/definitions/phandle-array
>> + items:
>> + - items:
>> + - description: phandle to the secure monitor
>> + - description: sensor index to get specific calibration data
>> +
>> required:
>> - compatible
>> - reg
>> - interrupts
>> - clocks
>> - - amlogic,ao-secure
>> +
>> +allOf:
>> + - if:
>> + properties:
>> + compatible:
>> + contains:
>> + enum:
>> + - amlogic,a1-cpu-thermal
>> + - amlogic,g12a-thermal
>> + then:
>> + required:
>> + - amlogic,ao-secure
>> + - if:
>> + properties:
>> + compatible:
>> + contains:
>> + const: amlogic,t7-thermal
>
> This can just be replaced by a else I think.
>
Thank you for your feedback, I will replace this `if` condition by an
`else`.
>> + then:
>> + required:
>> + - amlogic,secure-monitor
>>
>> unevaluatedProperties: false
>>
>> @@ -62,4 +91,13 @@ examples:
>> #thermal-sensor-cells = <0>;
>> amlogic,ao-secure = <&sec_AO>;
>> };
>> + - |
>> + a73_tsensor: temperature-sensor@20000 {
>
> Can drop the label here, it has no users.
>
Ok, I will remove this label.
> Otherwise, seems fine.
>
> Cheers,
> Conor.
>
> pw-bot: changes-requested
>
>> + compatible = "amlogic,t7-thermal";
>> + reg = <0x0 0x20000 0x0 0x50>;
>> + interrupts = <GIC_SPI 31 IRQ_TYPE_LEVEL_HIGH>;
>> + clocks = <&clkc_periphs CLKID_TS>;
>> + #thermal-sensor-cells = <0>;
>> + amlogic,secure-monitor = <&sm 1>;
>> + };
>> ...
>>
>> --
>> 2.49.0
>>
--
Best regards,
Ronald
^ permalink raw reply
* Re: [PATCH v7 0/8] Add support for handling PCIe M.2 Key E connectors in devicetree
From: Andy Shevchenko @ 2026-04-14 12:03 UTC (permalink / raw)
To: Chen-Yu Tsai
Cc: Manivannan Sadhasivam, Manivannan Sadhasivam, Rob Herring,
Greg Kroah-Hartman, Jiri Slaby, Nathan Chancellor, Nicolas Schier,
Hans de Goede, Ilpo Järvinen, Mark Pearson, Derek J. Clark,
Krzysztof Kozlowski, Conor Dooley, Marcel Holtmann,
Luiz Augusto von Dentz, Bartosz Golaszewski, Bartosz Golaszewski,
linux-serial, linux-kernel, linux-kbuild, platform-driver-x86,
linux-pci, devicetree, linux-arm-msm, linux-bluetooth, linux-pm,
Stephan Gerhold, Dmitry Baryshkov, linux-acpi, Hans de Goede,
Bartosz Golaszewski
In-Reply-To: <CAGXv+5EGe59nJctLweEdZjb3MNmMvjuCHngGSfptzN985OiLdg@mail.gmail.com>
On Tue, Apr 14, 2026 at 06:29:02PM +0800, Chen-Yu Tsai wrote:
> On Tue, Apr 14, 2026 at 4:28 PM Andy Shevchenko
> <andriy.shevchenko@linux.intel.com> wrote:
> > On Tue, Apr 14, 2026 at 01:03:19PM +0800, Chen-Yu Tsai wrote:
> > > On Tue, Apr 14, 2026 at 12:08 AM Manivannan Sadhasivam <mani@kernel.org> wrote:
> > > > On Mon, Apr 13, 2026 at 07:33:12PM +0530, Manivannan Sadhasivam wrote:
> > > > > On Mon, Apr 13, 2026 at 03:54:59PM +0800, Chen-Yu Tsai wrote:
> > > > > > On Thu, Mar 26, 2026 at 01:36:28PM +0530, Manivannan Sadhasivam wrote:
...
> > > > > > - Given that this connector actually represents two devices, how do I
> > > > > > say I want the BT part to be a wakeup source, but not the WiFi part?
> > > > > > Does wakeup-source even work at this point?
> > > > >
> > > > > You can't use the DT property since the devices are not described in DT
> > > > > statically. But you can still use the per-device 'wakeup' sysfs knob to enable
> > > > > wakeup.
> > >
> > > I see. I think not being able to specify generic properties for the devices
> > > on the connector is going to be a bit problematic.
> >
> > This is nature of the open-connectors, especially on the busses that are
> > hotpluggable, like PCIe. We never know what is connected there _ahead_.
>
> I believe what you mean by "hotpluggable" is "user replaceable".
From the OS perspective it's the same. From platform perspective
there is a difference, granted.
> > In other words you can't describe in DT something that may not exist.
>
> But this is actually doable with the PCIe slot representation. The
> properties are put in the device node for the slot. If no card is
> actually inserted in the slot, then no device is created, and the
> device node is left as not associated with anything.
But you need to list all devices in the world if you want to support this
somehow. Yes, probably many of them (or majority) will be enumerated as is,
but some may need an assistance via (dynamic) properties or similar mechanisms.
> It's just that for this new M.2 E-key connector, there aren't separate
> nodes for each interface. And the system doesn't associate the device
> node with the device, because it's no longer a child node of the
> controller or hierarchy, but connected over the OF graph.
>
> Moving over to the E-key connector representation seems like one step
> forward and one step backward in descriptive ability. We gain proper
> power sequencing, but lose generic properties.
The "key" is property of the connector. Hence if you have an idea what can be
common for ALL "key":s, that's probably can be abstracted. Note, I'm not
familiar with the connector framework in the Linux kernel, perhaps it's already
that kind of abstraction.
> The latter part is solvable, but we likely need child nodes under the
> connector for the different interfaces. Properties that make sense for
> one type might not make sense for another.
>
> P.S. We could also just add child device nodes under the controller to
> put the generic properties, but that's splitting the description into
> multiple parts. Let's not go there if at all possible.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* [GIT PULL] pmdomain updates for v7.1
From: Ulf Hansson @ 2026-04-14 10:38 UTC (permalink / raw)
To: Linus, linux-pm, linux-kernel; +Cc: Ulf Hansson, linux-arm-kernel
Hi Linus,
Here's the pull-request with pmdomain updates for v7.1. Details about the
highlights are as usual found in the signed tag.
Please pull this in!
Kind regards
Ulf Hansson
The following changes since commit e91d5f94acf68618ea3ad9c92ac28614e791ae7d:
pmdomain: imx8mp-blk-ctrl: Keep the NOC_HDCP clock enabled (2026-04-01 13:03:07 +0200)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm.git tags/pmdomain-v7.1
for you to fetch changes up to 596ca99cf04f339db2ed18a5bb230ee11a47b699:
pmdomain: qcom: rpmhpd: Add power domains for Hawi SoC (2026-04-08 12:01:37 +0200)
----------------------------------------------------------------
pmdomain core:
- Extend statistics for domain idle states with s2idle data
- Show latency/residency for domain idle states in debugfs
pmdomain providers:
- imx: Add support for optional subnodes for imx93-blk-ctrl
- marvell: Add audio power island for Marvell PXA1908
- mediatek: Add legacy support for the MT7622 audio power domain
- mediatek: Add nvmem provider functionality to the mtk-mfg-pmdomain
- mediatek: Add support for the MT8189 power domains
- qcom: Add support for the Eliza and Hawi power domains
- sunxi: Add support for the Allwinner A733 power domains
- ti: Handle wakeup constraints for out-of-band wakeups for ti_sci
----------------------------------------------------------------
Abel Vesa (2):
dt-bindings: power: qcom,rpmpd: document the Eliza RPMh Power Domains
pmdomain: qcom: rpmhpd: Add Eliza RPMh Power Domains
AngeloGioacchino Del Regno (2):
dt-bindings: power: mt7622-power: Add MT7622_POWER_DOMAIN_AUDIO
pmdomain: mediatek: scpsys: Add MT7622 Audio power domain to legacy driver
Chris Morgan (1):
pmdomain: rockchip: quiet regulator error on -EPROBE_DEFER
Dmitry Baryshkov (1):
PM: domains: De-constify fields in struct dev_pm_domain_attach_data
Felix Gu (2):
pmdomain: ti: omap_prm: Fix a reference leak on device node
pmdomain: imx: scu-pd: Fix device_node reference leak during ->probe()
Fenglin Wu (2):
dt-bindings: power: qcom,rpmhpd: Add RPMh power domain for Hawi SoC
pmdomain: qcom: rpmhpd: Add power domains for Hawi SoC
Gabor Juhos (1):
pmdomain: qcom: rpmpd: drop stray semicolon
Irving-CH Lin (3):
dt-bindings: power: Add MediaTek MT8189 power domain
pmdomain: mediatek: Add bus protect control flow for MT8189
pmdomain: mediatek: Add power domain driver for MT8189 SoC
Karel Balej (2):
dt-bindings: power: define ID for Marvell PXA1908 audio domain
pmdomain: add audio power island for Marvell PXA1908 SoC
Kendall Willis (1):
pmdomain: ti_sci: handle wakeup constraint for out-of-band wakeup
Krzysztof Kozlowski (1):
pmdomain: mediatek: Simplify with scoped for each OF child loop
Marco Felsch (3):
pmdomain: imx93-blk-ctrl: cleanup error path
pmdomain: imx93-blk-ctrl: convert to devm_* only
pmdomain: imx93-blk-ctrl: add support for optional subnodes
Maíra Canal (1):
pmdomain: bcm: bcm2835-power: Replace open-coded polling with readl_poll_timeout_atomic()
Nicolas Frattaroli (2):
dt-bindings: power: mt8196-gpufreq: Describe nvmem provider ability
pmdomain: mediatek: mtk-mfg: Expose shader_present as nvmem cell
Rosen Penev (2):
pmdomain: qcom: cpr: simplify main allocation
pmdomain: qcom: cpr: add COMPILE_TEST support
Ulf Hansson (8):
pmdomain: Merge branch dt into next
pmdomain: core: Restructure domain idle states data for genpd in debugfs
pmdomain: core: Show latency/residency for domain idle states in debugfs
pmdomain: core: Extend statistics for domain idle states with s2idle data
pmdomain: arm: Add print after a successful probe for SCMI power domains
pmdomain: Merge branch pmdomain into next
pmdomain: Merge branch fixes into next
pmdomain: Merge branch dt into next
Yuanshen Cao (2):
dt-bindings: power: Add Support for Allwinner A733 PCK600 Power Domain Controller
pmdomain: sunxi: Add support for A733 to Allwinner PCK600 driver
.../bindings/power/allwinner,sun20i-d1-ppu.yaml | 17 +-
.../bindings/power/mediatek,mt8196-gpufreq.yaml | 13 +
.../bindings/power/mediatek,power-controller.yaml | 1 +
.../devicetree/bindings/power/qcom,rpmpd.yaml | 2 +
drivers/pmdomain/arm/scmi_pm_domain.c | 1 +
drivers/pmdomain/bcm/bcm2835-power.c | 25 +-
drivers/pmdomain/core.c | 59 ++-
drivers/pmdomain/imx/imx93-blk-ctrl.c | 77 ++--
drivers/pmdomain/imx/scu-pd.c | 1 +
.../pmdomain/marvell/pxa1908-power-controller.c | 39 +-
drivers/pmdomain/mediatek/mt8189-pm-domains.h | 485 +++++++++++++++++++++
drivers/pmdomain/mediatek/mtk-mfg-pmdomain.c | 59 +++
drivers/pmdomain/mediatek/mtk-pm-domains.c | 44 +-
drivers/pmdomain/mediatek/mtk-pm-domains.h | 5 +
drivers/pmdomain/mediatek/mtk-scpsys.c | 10 +
drivers/pmdomain/qcom/Kconfig | 2 +-
drivers/pmdomain/qcom/cpr.c | 13 +-
drivers/pmdomain/qcom/rpmhpd.c | 58 +++
drivers/pmdomain/qcom/rpmpd.c | 2 +-
drivers/pmdomain/rockchip/pm-domains.c | 7 +-
drivers/pmdomain/sunxi/sun55i-pck600.c | 35 +-
drivers/pmdomain/ti/omap_prm.c | 1 +
drivers/pmdomain/ti/ti_sci_pm_domains.c | 5 +-
.../power/allwinner,sun60i-a733-pck-600.h | 18 +
include/dt-bindings/power/marvell,pxa1908-power.h | 1 +
include/dt-bindings/power/mediatek,mt8189-power.h | 38 ++
include/dt-bindings/power/mt7622-power.h | 1 +
include/dt-bindings/power/qcom,rpmhpd.h | 12 +
include/linux/pm_domain.h | 5 +-
29 files changed, 932 insertions(+), 104 deletions(-)
create mode 100644 drivers/pmdomain/mediatek/mt8189-pm-domains.h
create mode 100644 include/dt-bindings/power/allwinner,sun60i-a733-pck-600.h
create mode 100644 include/dt-bindings/power/mediatek,mt8189-power.h
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox