[PATCH AUTOSEL 5.10 1/3] sched: Don't try to catch up excess steal time.

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 5.10 1/3] sched: Don't try to catch up excess steal time.
@ 2025-01-26 14:51 Sasha Levin
  2025-01-26 14:51 ` [PATCH AUTOSEL 5.10 2/3] lockdep: Fix upper limit for LOCKDEP_*_BITS configs Sasha Levin
  2025-01-26 14:51 ` [PATCH AUTOSEL 5.10 3/3] x86/amd_nb: Restrict init function to AMD-based systems Sasha Levin
  0 siblings, 2 replies; 3+ messages in thread
From: Sasha Levin @ 2025-01-26 14:51 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Suleiman Souhlal, Peter Zijlstra, Sasha Levin, mingo, juri.lelli,
	vincent.guittot

From: Suleiman Souhlal <suleiman@google.com>

[ Upstream commit 108ad0999085df2366dd9ef437573955cb3f5586 ]

When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.

This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.

Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.

Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.

[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/sched/core.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7cf45d506688c..42dad8c8d6f28 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -279,13 +279,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #endif
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
 	if (static_key_false((&paravirt_steal_rq_enabled))) {
-		steal = paravirt_steal_clock(cpu_of(rq));
+		u64 prev_steal;
+
+		steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
 		steal -= rq->prev_steal_time_rq;

 		if (unlikely(steal > delta))
 			steal = delta;

-		rq->prev_steal_time_rq += steal;
+		rq->prev_steal_time_rq = prev_steal;
 		delta -= steal;
 	}
 #endif
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 5.10 2/3] lockdep: Fix upper limit for LOCKDEP_*_BITS configs
  2025-01-26 14:51 [PATCH AUTOSEL 5.10 1/3] sched: Don't try to catch up excess steal time Sasha Levin
@ 2025-01-26 14:51 ` Sasha Levin
  2025-01-26 14:51 ` [PATCH AUTOSEL 5.10 3/3] x86/amd_nb: Restrict init function to AMD-based systems Sasha Levin
  1 sibling, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2025-01-26 14:51 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Carlos Llamas, J . R . Okajima, Peter Zijlstra, Boqun Feng,
	Ingo Molnar, Waiman Long, Will Deacon, Sasha Levin

From: Carlos Llamas <cmllamas@google.com>

[ Upstream commit e638072e61726cae363d48812815197a2a0e097f ]

Lockdep has a set of configs used to determine the size of the static
arrays that it uses. However, the upper limit that was initially setup
for these configs is too high (30 bit shift). This equates to several
GiB of static memory for individual symbols. Using such high values
leads to linker errors:

  $ make defconfig
  $ ./scripts/config -e PROVE_LOCKING --set-val LOCKDEP_BITS 30
  $ make olddefconfig all
  [...]
  ld: kernel image bigger than KERNEL_IMAGE_SIZE
  ld: section .bss VMA wraps around address space

Adjust the upper limits to the maximum values that avoid these issues.
The need for anything more, likely points to a problem elsewhere. Note
that LOCKDEP_CHAINS_BITS was intentionally left out as its upper limit
had a different symptom and has already been fixed [1].

Reported-by: J. R. Okajima <hooanon05g@gmail.com>
Closes: https://lore.kernel.org/all/30795.1620913191@jrobl/ [1]
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/r/20241024183631.643450-2-cmllamas@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 lib/Kconfig.debug | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 24ca61cf86ddc..c20729cd67b1e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1315,7 +1315,7 @@ config LOCKDEP_SMALL
 config LOCKDEP_BITS
 	int "Bitsize for MAX_LOCKDEP_ENTRIES"
 	depends on LOCKDEP && !LOCKDEP_SMALL
-	range 10 30
+	range 10 24
 	default 15
 	help
 	  Try increasing this value if you hit "BUG: MAX_LOCKDEP_ENTRIES too low!" message.
@@ -1331,7 +1331,7 @@ config LOCKDEP_CHAINS_BITS
 config LOCKDEP_STACK_TRACE_BITS
 	int "Bitsize for MAX_STACK_TRACE_ENTRIES"
 	depends on LOCKDEP && !LOCKDEP_SMALL
-	range 10 30
+	range 10 26
 	default 19
 	help
 	  Try increasing this value if you hit "BUG: MAX_STACK_TRACE_ENTRIES too low!" message.
@@ -1339,7 +1339,7 @@ config LOCKDEP_STACK_TRACE_BITS
 config LOCKDEP_STACK_TRACE_HASH_BITS
 	int "Bitsize for STACK_TRACE_HASH_SIZE"
 	depends on LOCKDEP && !LOCKDEP_SMALL
-	range 10 30
+	range 10 26
 	default 14
 	help
 	  Try increasing this value if you need large MAX_STACK_TRACE_ENTRIES.
@@ -1347,7 +1347,7 @@ config LOCKDEP_STACK_TRACE_HASH_BITS
 config LOCKDEP_CIRCULAR_QUEUE_BITS
 	int "Bitsize for elements in circular_queue struct"
 	depends on LOCKDEP
-	range 10 30
+	range 10 26
 	default 12
 	help
 	  Try increasing this value if you hit "lockdep bfs error:-1" warning due to __cq_enqueue() failure.
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 5.10 3/3] x86/amd_nb: Restrict init function to AMD-based systems
  2025-01-26 14:51 [PATCH AUTOSEL 5.10 1/3] sched: Don't try to catch up excess steal time Sasha Levin
  2025-01-26 14:51 ` [PATCH AUTOSEL 5.10 2/3] lockdep: Fix upper limit for LOCKDEP_*_BITS configs Sasha Levin
@ 2025-01-26 14:51 ` Sasha Levin
  1 sibling, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2025-01-26 14:51 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Yazen Ghannam, Borislav Petkov, Sasha Levin, tglx, mingo,
	dave.hansen, x86, Shyam-sundar.S-k, mario.limonciello,
	richard.gong

From: Yazen Ghannam <yazen.ghannam@amd.com>

[ Upstream commit bee9e840609cc67d0a7d82f22a2130fb7a0a766d ]

The code implicitly operates on AMD-based systems by matching on PCI
IDs. However, the use of these IDs is going away.

Add an explicit CPU vendor check instead of relying on PCI IDs.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-3-yazen.ghannam@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/kernel/amd_nb.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c
index 16cd56627574d..3dcaeb25ee301 100644
--- a/arch/x86/kernel/amd_nb.c
+++ b/arch/x86/kernel/amd_nb.c
@@ -536,6 +536,10 @@ static __init void fix_erratum_688(void)
 
 static __init int init_amd_nbs(void)
 {
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD &&
+	    boot_cpu_data.x86_vendor != X86_VENDOR_HYGON)
+		return 0;
+
 	amd_cache_northbridges();
 	amd_cache_gart();
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-01-26 14:51 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-26 14:51 [PATCH AUTOSEL 5.10 1/3] sched: Don't try to catch up excess steal time Sasha Levin
2025-01-26 14:51 ` [PATCH AUTOSEL 5.10 2/3] lockdep: Fix upper limit for LOCKDEP_*_BITS configs Sasha Levin
2025-01-26 14:51 ` [PATCH AUTOSEL 5.10 3/3] x86/amd_nb: Restrict init function to AMD-based systems Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).