linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Florian Fainelli <f.fainelli@gmail.com>
To: Juri Lelli <juri.lelli@redhat.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Steven Rostedt <rostedt@goodmis.org>,
	Valentin Schneider <vschneid@redhat.com>
Cc: Doug Berger <opendmb@gmail.com>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched/topology: clear freecpu bit on detach
Date: Tue, 3 Jun 2025 09:18:36 -0700	[thread overview]
Message-ID: <d46f7efb-db0c-44a4-9d67-bbdc91994cf9@gmail.com> (raw)
In-Reply-To: <bc44f895-3df4-41c4-bf93-d56e6ed203f3@gmail.com>

On 5/23/25 11:14, Florian Fainelli wrote:
> Moving CC list to To
> 
> On 5/2/25 06:02, Juri Lelli wrote:
>> Hi,
>>
>> On 29/04/25 10:15, Florian Fainelli wrote:
>>>
>>>
>>> On 4/22/2025 9:48 PM, Doug Berger wrote:
>>>> There is a hazard in the deadline scheduler where an offlined CPU
>>>> can have its free_cpus bit left set in the def_root_domain when
>>>> the schedutil cpufreq governor is used. This can allow a deadline
>>>> thread to be pushed to the runqueue of a powered down CPU which
>>>> breaks scheduling. The details can be found here:
>>>> https://lore.kernel.org/lkml/20250110233010.2339521-1-opendmb@gmail.com
>>>>
>>>> The free_cpus mask is expected to be cleared by set_rq_offline();
>>>> however, the hazard occurs before the root domain is made online
>>>> during CPU hotplug so that function is not invoked for the CPU
>>>> that is being made active.
>>>>
>>>> This commit works around the issue by ensuring the free_cpus bit
>>>> for a CPU is always cleared when the CPU is removed from a
>>>> root_domain. This likely makes the call of cpudl_clear_freecpu()
>>>> in rq_offline_dl() fully redundant, but I have not removed it
>>>> here because I am not certain of all flows.
>>>>
>>>> It seems likely that a better solution is possible from someone
>>>> more familiar with the scheduler implementation, but this
>>>> approach is minimally invasive from someone who is not.
>>>>
>>>> Signed-off-by: Doug Berger <opendmb@gmail.com>
>>>> ---
>>>
>>> FWIW, we were able to reproduce this with the attached hotplug.sh script
>>> which would just randomly hot plug/unplug CPUs (./hotplug.sh 4). 
>>> Within a
>>> few hundred of iterations you could see the lock up occur, it's 
>>> unclear why
>>> this has not been seen by more people.
>>>
>>> Since this is not the first posting or attempt at fixing this bug [1] 
>>> and we
>>> consider it to be a serious one, can this be reviewed/commented on/ 
>>> applied?
>>> Thanks!
>>>
>>> [1]: https://lkml.org/lkml/2025/1/14/1687
>>
>> So, going back to the initial report, the thing that makes me a bit
>> uncomfortable with the suggested change is the worry that it might be
>> plastering over a more fundamental issue. Not against it, though, and I
>> really appreciate Doug's analysis and proposed fixes!
>>
>> Doug wrote:
>>
>> "Initially, CPU0 and CPU1 are active and CPU2 and CPU3 have been
>> previously offlined so their runqueues are attached to the
>> def_root_domain.
>> 1) A hot plug is initiated on CPU2.
>> 2) The cpuhp/2 thread invokes the cpufreq governor driver during
>>     the CPUHP_AP_ONLINE_DYN step.
>> 3) The sched util cpufreq governor creates the "sugov:2" thread to
>>     execute on CPU2 with the deadline scheduler.
>> 4) The deadline scheduler clears the free_cpus mask for CPU2 within
>>     the def_root_domain when "sugov:2" is scheduled."
>>
>> I wonder if it's OK to schedule sugov:2 on a CPU that didn't reach yet
>> complete online state. Peter, others, what do you think?
> 
> Peter, can you please review this patch? Thank you

Ping? Can we get to some resolution on way or another here? Thanks
-- 
Florian

  reply	other threads:[~2025-06-03 16:18 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-22 19:48 [PATCH] sched/topology: clear freecpu bit on detach Doug Berger
2025-04-29  8:15 ` Florian Fainelli
2025-05-02 13:02   ` Juri Lelli
2025-05-23 18:14     ` Florian Fainelli
2025-06-03 16:18       ` Florian Fainelli [this message]
2025-06-11 20:06         ` Florian Fainelli
2025-07-25 22:33 ` Doug Berger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d46f7efb-db0c-44a4-9d67-bbdc91994cf9@gmail.com \
    --to=f.fainelli@gmail.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=opendmb@gmail.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).