From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id EFDBE1D61B1
	for <linux-kernel@vger.kernel.org>; Thu, 19 Dec 2024 21:38:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1734644312; cv=none; b=nE5wwLFE6Va0ssZO5fbK27j+xeHvVsvZiF87I5zAbnGH5wSQKJrJux/B0m2ks/1ODGXVUiqrr9kXkC/XqD+j6Q8vcarYKaE8TTbECXDZQ7wE6108zbzznM7j3cQI7Ov/Nf+zEpGNq8nEAPQOmV/FXgyC+xtwYdR1jJCABJJAguw=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1734644312; c=relaxed/simple;
	bh=D8sbOHXFOY2EN0fSHJyCUVCgjpQluKQLk2M4WSmQBeE=;
	h=From:To:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=IhMfw11O0mmSFxr/fl73gvqQdeivjWThoYJHPRQBsCz8sUZ2sAAivfc72Tk2AIl1h5EXCG8RFyZyus4p3DHQ4L3QyOWXOwbwxoMXRAKrC7l5gdteV6DQPb/CB8mBJRnYkgqcx5nh9Q/8Fo0lVH7EyFFCWsiFMVOamQbH2dnIop4=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=om0xomOr; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=Qv6PgzGk; arc=none smtp.client-ip=193.142.43.55
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="om0xomOr";
	dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="Qv6PgzGk"
From: Thomas Gleixner <tglx@linutronix.de>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1734644308;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=FddzshwyoO0fybx3gFO4DRoxp50NsYlwaVQEVJz9Zyk=;
	b=om0xomOrkjrlzYMkwmhOBriXOm3WR4J2UfNDndzO70nG2AvGpQWjAS3fG3JnszfT7och1P
	MPfXpcJjMU3YvHBp4jW7JFX9ekJJqIa6cfhx6JWnv8M3utDpvRw2ePrQJo/fKZu4l7Vyoh
	Tt5r7X01LfRNLwEq5rttrmZzWADmzmNitE4kqIv0SVU3NiKUAYEpvwUujfzQTAsOF6+hZl
	0XP5HIGc8f4ifMkNuY0BSTslAAXKLZnsFKlRxQZm8yHGXQ1KXPgeJ3xMkXfQikaUP/MrTT
	gmB3A0P8ISnltNCGPrOgzryvWEuMM6kYxuqF5dKsomI5iU3Tb6vpcZsrD19IeQ==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1734644308;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=FddzshwyoO0fybx3gFO4DRoxp50NsYlwaVQEVJz9Zyk=;
	b=Qv6PgzGkuIR5HuVtJD6RiHSQ76lu96lv0i0ooM3xxLsCpTHPApBajI0ZnQBBN74AGskEC6
	XRZcjFf8AHu+zSCg==
To: Costa Shulyupin <costa.shul@redhat.com>, Peter Zijlstra
 <peterz@infradead.org>, Yury Norov <yury.norov@gmail.com>, Andrew Morton
 <akpm@linux-foundation.org>, Costa Shulyupin <costa.shul@redhat.com>,
 Valentin Schneider <vschneid@redhat.com>, Frederic Weisbecker
 <frederic@kernel.org>, Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
 linux-kernel@vger.kernel.org, Waiman Long <longman@redhat.com>,
 x86@kernel.org
Subject: Re: [RFC PATCH v1] stop_machine: Add stop_housekeeping_cpuslocked()
In-Reply-To: <20241218171531.2217275-1-costa.shul@redhat.com>
References: <20241218171531.2217275-1-costa.shul@redhat.com>
Date: Thu, 19 Dec 2024 22:38:28 +0100
Message-ID: <87frmj73qj.ffs@tglx>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain

Costa!

On Wed, Dec 18 2024 at 19:15, Costa Shulyupin wrote:
> CPU hotplug interferes with CPU isolation and introduces latency to
> real-time tasks.
>
> The test:
>
>   rtla timerlat hist -c 1 -a 500 &
>   echo 0 > /sys/devices/system/cpu/cpu2/online
>
> The RTLA tool reveals the following blocking thread stack trace:
>
>                -> multi_cpu_stop
>                -> cpu_stopper_thread
>                -> smpboot_thread_fn
>
> This happens because multi_cpu_stop() disables interrupts for EACH online
> CPU since takedown_cpu() indirectly invokes take_cpu_down() through
> stop_machine_cpuslocked(). I'm omitting the detailed description of the
> call chain.

TLDR: It's well known that taking a CPU offline is done via
      stop_machine(), which requires all online CPUs to spin in the
      stopper thread.

> Proposal: Limit the stop operation to housekeeping CPUs.

You proposed other non-working solutions before. How is that any
different?

Especially after Peter told you how to proceed:

  https://lore.kernel.org/all/20241209095730.GG21636@noisy.programming.kicks-ass.net

He said after you tinkered with stop_one_cpu():

  "Audit the full cpu hotplug stack and determine who all relies on this
   'implicit' serialization, then proceed to provide alternative
   solutions for these sites. Then, at the very end move to
   stop_one_cpu()."

Where is that audit or the argument why you don't need the audit because
you now restrict stop_machine() to the housekeeping CPUs?

> take_cpu_down() invokes with cpuhp_invoke_callback_range_nofail:
>  - tick_cpu_dying()
>  - hrtimers_cpu_dying()
>  - smpcfd_dying_cpu()
>  - x86_pmu_dying_cpu()
>  - rcutree_dying_cpu()
>  - sched_cpu_dying()
>  - cache_ap_offline()

That's what happens to be called with your kernel config on your test
system. What about the other 50+ CPU hotplug states in that range, which
are .config or architecture dependent?

> Which synchronizations do these functions require instead of
> stop_machine?

So you propose a change without knowing what the consequences are and
expect us to do the audit (aka. your homework) and provide it to you on
a silver tablet?

Actually you are asking the wrong question. Peter told you to determine
which other parts of the kernel rely on that implicit serialization for
a reason:

  The callbacks on the unplugged CPU are not the main problem, but still
  they have to be audited one by one.

  The real problem is the state change caused by these callbacks, which
  is visible to the rest of the system. Which consequences has such a
  state change, if it is not serialized via stop_machine().

  And that's what Peter asked you to audit and provide, no?

> Passed standard regression tests:
>  - https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6042
>  - https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-10/-/merge_requests/89
>
> Passes primitive CPU hotplug test without warnings.

Q: What is the value of this statement?

A: Zero, as the change comes with zero analysis and the random centos
   regression tests based on primitive CPU hotplug tests are as useful
   as running hello_world().

> What tests and test suites do you recommend?

You are not really asking me to explain the usage of git grep and gurgle
search?

Let me give you a hint.

  You now propose to only stop the housekeeping CPUs because (my
  assumption) you assume that all isolated CPUs are happily computing
  along in user space or in a VM, which is equivalent from the kernel
  POV.

Q: How is that assumption justified?

A: Not at all. [1]

   There is no guarantee that isolated CPUs (user/vm) are not inside the
   kernel, are not coming back to the kernel while the unplug is in
   progress or come back to the kernel post unplug and observe stale
   state.

That's not something which can be answered by random test suites. That's
something which has to be audited and argued for correctness, which is
what you have been asked to provide, right?

The test suites are there to validate correctness and to eventually find
the odd corner cases you missed in your analysis and which were missed
by people reviewing it. They cannot replace upfront analysis as their
test coverage is not complete.

Your approach to solve this problem seems to be to throw random
sh^Wideas at the wall and see what sticks. That's not how it works.

You want this "feature", so it's your task to do the homework, i.e. the
analysis you have been asked for, and to argue why your changes are
correct under all circumstances and not only under a set of your made up
assumptions. Even if these assumptions match the reality of your use
case, you still have to provide the argument and the prove that there is
no other use case, which makes your assumptions moot.

For illustration let me go back to [1] above. Assume there are two
housekeeping CPUs and two isolated CPUs.

HKCPU0		HKCPU1		ISOLCPU0	ISOLCPU1

dont'care       don't care      user space      don't care      // context

unplug(ISOLCPU1)                
                                                CPU hotplug thread
                                                invokes all thread
                                                context callbacks

takedown_cpu()
  stop_machine(HKCPUS)

                stopper_thread()                stopper_thread()
  stopper_thread()				  local_irq_disable();
                                                  // All HK CPUs arrived
                                                  invoke_dying_callbacks()

                                 syscall/vmexit()
                                  ....
                                  smp_function_call(cpu_online_mask, wait=true)
                                    ....
                                    wait_for_completion()

                                                  set_cpu_offline(false);

ISOLCPU0 now waits forever because ISOLCPU1 cannot process the IPI as it
has interrupts disabled already and had not yet removed itself in the
CPU online mask wehn ISOLCPU0 sent the IPI.

That's one simple example which makes your assumption completely moot as
you cannot make any unconditionally valid assumptions about user space
or VM behaviour. There are tons of other examples, but that's not the
maintainers/reviewers job to explain them to you.

It's your job to explain to the maintainers/reviewers why these problems
do not exist according to your fact based analysis.

Please come back when you can provide such fact based analysis and
spare us the next random "brilliant" idea thrown at the wall.

Thanks,

        tglx