From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Suleiman Souhlal <suleiman@google.com>,
Peter Zijlstra <peterz@infradead.org>,
Sasha Levin <sashal@kernel.org>,
mingo@redhat.com, juri.lelli@redhat.com,
vincent.guittot@linaro.org
Subject: [PATCH AUTOSEL 6.13 1/7] sched: Don't try to catch up excess steal time.
Date: Sun, 26 Jan 2025 09:50:04 -0500 [thread overview]
Message-ID: <20250126145011.925720-1-sashal@kernel.org> (raw)
From: Suleiman Souhlal <suleiman@google.com>
[ Upstream commit 108ad0999085df2366dd9ef437573955cb3f5586 ]
When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, in some situations, as they end up getting additional steal
time that did not actually happen.
This is because there is a window between reading the elapsed time in
update_rq_clock() and sampling the steal time in update_rq_clock_task().
If the VCPU gets preempted between those two points, any additional
steal time is accounted to the outgoing task even though the calculated
delta did not actually contain any of that "stolen" time.
When this race happens, we can end up with steal time that exceeds the
calculated delta, and the previous code would try to catch up that excess
steal time in future clock updates, which is given to the next,
incoming task, even though it did not actually have any time stolen.
This behavior is particularly bad when steal time can be very long,
which we've seen when trying to extend steal time to contain the duration
that the host was suspended [0]. When this happens, clock_task stays
frozen, during which the running task stays running for the whole
duration, since its run time doesn't increase.
However the race can happen even under normal operation.
Ideally we would read the elapsed cpu time and the steal time atomically,
to prevent this race from happening in the first place, but doing so
is non-trivial.
Since the time between those two points isn't otherwise accounted anywhere,
neither to the outgoing task nor the incoming task (because the "end of
outgoing task" and "start of incoming task" timestamps are the same),
I would argue that the right thing to do is to simply drop any excess steal
time, in order to prevent these issues.
[0] https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241118043745.1857272-1-suleiman@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
kernel/sched/core.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e5a6bf587f91..296e77380318e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -766,13 +766,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
- steal = paravirt_steal_clock(cpu_of(rq));
+ u64 prev_steal;
+
+ steal = prev_steal = paravirt_steal_clock(cpu_of(rq));
steal -= rq->prev_steal_time_rq;
if (unlikely(steal > delta))
steal = delta;
- rq->prev_steal_time_rq += steal;
+ rq->prev_steal_time_rq = prev_steal;
delta -= steal;
}
#endif
--
2.39.5
next reply other threads:[~2025-01-26 14:50 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-26 14:50 Sasha Levin [this message]
2025-01-26 14:50 ` [PATCH AUTOSEL 6.13 2/7] sched/deadline: Correctly account for allocated bandwidth during hotplug Sasha Levin
2025-01-26 14:50 ` [PATCH AUTOSEL 6.13 3/7] sched/deadline: Check bandwidth overflow earlier for hotplug Sasha Levin
2025-01-26 14:50 ` [PATCH AUTOSEL 6.13 4/7] x86: Convert unreachable() to BUG() Sasha Levin
2025-01-26 14:50 ` [PATCH AUTOSEL 6.13 5/7] locking/ww_mutex/test: Use swap() macro Sasha Levin
2025-01-26 14:50 ` [PATCH AUTOSEL 6.13 6/7] lockdep: Fix upper limit for LOCKDEP_*_BITS configs Sasha Levin
2025-01-26 14:50 ` [PATCH AUTOSEL 6.13 7/7] x86/amd_nb: Restrict init function to AMD-based systems Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250126145011.925720-1-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=stable@vger.kernel.org \
--cc=suleiman@google.com \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).