From: Frederic Weisbecker <frederic@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Zucheng Zheng <zhengzucheng@huawei.com>,
mingo@redhat.com, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
bristot@redhat.com, vschneid@redhat.com, hucool.lihua@huawei.com,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH -next] sched/cputime: Fix the time backward issue about /proc/stat
Date: Wed, 28 Sep 2022 14:11:34 +0200 [thread overview]
Message-ID: <20220928121134.GA233658@lothringen> (raw)
In-Reply-To: <YzQB8afi2rCPvuC1@hirez.programming.kicks-ass.net>
On Wed, Sep 28, 2022 at 10:12:33AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 28, 2022 at 11:34:02AM +0800, Zucheng Zheng wrote:
> > From: Zheng Zucheng <zhengzucheng@huawei.com>
> >
> > The cputime of cpuN read from /proc/stat has an issue of cputime descent.
> > For example, the phenomenon of cputime descent of user is as followed:
> >
> > The value read first is 319, and the value read again is 318. As follows:
> > first:
> > cat /proc/stat | grep cpu1
> > cpu1 319 0 496 41665 0 0 0 0 0 0
> > again:
> > cat /proc/stat | grep cpu1
> > cpu1 318 0 497 41674 0 0 0 0 0 0
> >
> > The value read from /proc/stat should be monotonically increasing. Otherwise
> > user may get incorrect CPU usage.
> >
> > The root cause of this problem is that, in the implementation of
> > kcpustat_cpu_fetch_vtime, vtime->utime + delta is added to the stack variable
> > cpustat instantaneously. If the task is switched between two times, the value
> > added to cpustat for the second time may be smaller than that for the first time.
> >
> > CPU0 CPU1
> > First:
> > show_stat()
> > ->kcpustat_cpu_fetch()
> > ->kcpustat_cpu_fetch_vtime()
> > ->cpustat[CPUTIME_USER] = kcpustat_cpu(cpu) + vtime->utime + delta rq->curr is task A
> > A switch to B,and A->vtime->utime is less than 1 tick
> > Then:
> > show_stat()
> > ->kcpustat_cpu_fetch()
> > ->kcpustat_cpu_fetch_vtime()
> > ->cpustat[CPUTIME_USER] = kcpustat_cpu(cpu) + vtime->utime + delta; rq->curr is task B
>
> You're still not explaining where the time gets lost. And the patch is a
> horrible band-aid.
>
> What I think you're saying; after staring at this for a while, is that:
>
> vtime_task_switch_generic()
> __vtime_account_kernel(prev, vtime)
> vtime_account_{guest,system}(tsk, vtime)
> vtime->*time += get_vtime_delta()
> if (vtime->*time >= TICK_NSEC)
> account_*_time()
> account_system_index_time()
> task_group_account_field()
> __this_cpu_add(kernel_cpustat.cpustat[index], tmp); <---- here
>
> is not folding time into kernel_cpustat when the task vtime isn't at
> least a tick's worth. And then when we switch to another task, we leak
> time.
Looks right. Last time the patch was posted I misunderstood the issue.
>
> There's another problem here, vtime_task_switch_generic() should use a
> single call to sched_clock() to compute the old vtime_delta and set the
> new vtime->starttime, otherwise there's a time hole there as well.
Right, but does it really matter? It's just a few nanosecs ignored
between two tasks switching.
>
> This is all quite the maze and it really wants cleaning up, not be made
> worse.
>
> So I think you want to do two things:
>
> - pull kernel_cpustat updates out of task_group_account_field()
> and put them into vtime_task_switch_generic() to be purely
> vtime->starttime based.
So you want to force the update on all context switches? We used that TICK_NSEC
limit before updating in order to lower some overhead.
There is also user <-> kernel involved.
How about handling that from the read side? (below untested)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 78a233d43757..f0f1af337e49 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -814,9 +814,9 @@ u64 task_gtime(struct task_struct *t)
do {
seq = read_seqcount_begin(&vtime->seqcount);
- gtime = t->gtime;
+ gtime = t->gtime + vtime->gtime;
if (vtime->state == VTIME_GUEST)
- gtime += vtime->gtime + vtime_delta(vtime);
+ gtime += vtime_delta(vtime);
} while (read_seqcount_retry(&vtime->seqcount, seq));
@@ -845,8 +845,8 @@ bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
ret = false;
seq = read_seqcount_begin(&vtime->seqcount);
- *utime = t->utime;
- *stime = t->stime;
+ *utime = t->utime + vtime->utime;
+ *stime = t->stime + vtime->stime;
/* Task is sleeping or idle, nothing to add */
if (vtime->state < VTIME_SYS)
@@ -860,9 +860,9 @@ bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
* add pending nohz time to the right place.
*/
if (vtime->state == VTIME_SYS)
- *stime += vtime->stime + delta;
+ *stime += delta;
else
- *utime += vtime->utime + delta;
+ *utime += delta;
} while (read_seqcount_retry(&vtime->seqcount, seq));
return ret;
@@ -896,11 +896,22 @@ static int vtime_state_fetch(struct vtime *vtime, int cpu)
static u64 kcpustat_user_vtime(struct vtime *vtime)
{
- if (vtime->state == VTIME_USER)
- return vtime->utime + vtime_delta(vtime);
- else if (vtime->state == VTIME_GUEST)
- return vtime->gtime + vtime_delta(vtime);
- return 0;
+ u64 delta = vtime->utime + vtime->gtime;
+
+ if (vtime->state == VTIME_USER || vtime->state == VTIME_GUEST)
+ delta += vtime_delta(vtime);
+
+ return delta;
+}
+
+static u64 kcpustat_guest_vtime(struct vtime *vtime)
+{
+ u64 delta = vtime->gtime;
+
+ if (vtime->state == VTIME_GUEST)
+ delta += vtime_delta(vtime);
+
+ return delta;
}
static int kcpustat_field_vtime(u64 *cpustat,
@@ -931,8 +942,9 @@ static int kcpustat_field_vtime(u64 *cpustat,
*/
switch (usage) {
case CPUTIME_SYSTEM:
+ *val += vtime->stime;
if (state == VTIME_SYS)
- *val += vtime->stime + vtime_delta(vtime);
+ *val += vtime_delta(vtime);
break;
case CPUTIME_USER:
if (task_nice(tsk) <= 0)
@@ -943,12 +955,12 @@ static int kcpustat_field_vtime(u64 *cpustat,
*val += kcpustat_user_vtime(vtime);
break;
case CPUTIME_GUEST:
- if (state == VTIME_GUEST && task_nice(tsk) <= 0)
- *val += vtime->gtime + vtime_delta(vtime);
+ if (task_nice(tsk) <= 0)
+ *val += kcpustat_guest_vtime(vtime);
break;
case CPUTIME_GUEST_NICE:
- if (state == VTIME_GUEST && task_nice(tsk) > 0)
- *val += vtime->gtime + vtime_delta(vtime);
+ if (task_nice(tsk) > 0)
+ *val += kcpustat_guest_vtime(vtime);
break;
default:
break;
@@ -1013,6 +1025,15 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
*dst = *src;
cpustat = dst->cpustat;
+ cpustat[CPUTIME_SYSTEM] += vtime->stime;
+ if (task_nice(tsk) > 0) {
+ cpustat[CPUTIME_NICE] += vtime->utime + vtime->gtime;
+ cpustat[CPUTIME_GUEST_NICE] += vtime->gtime;
+ } else {
+ cpustat[CPUTIME_USER] += vtime->utime + vtime->gtime;
+ cpustat[CPUTIME_GUEST] += vtime->gtime;
+ }
+
/* Task is sleeping, dead or idle, nothing to add */
if (state < VTIME_SYS)
continue;
@@ -1024,20 +1045,20 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst,
* add pending nohz time to the right place.
*/
if (state == VTIME_SYS) {
- cpustat[CPUTIME_SYSTEM] += vtime->stime + delta;
+ cpustat[CPUTIME_SYSTEM] += delta;
} else if (state == VTIME_USER) {
if (task_nice(tsk) > 0)
- cpustat[CPUTIME_NICE] += vtime->utime + delta;
+ cpustat[CPUTIME_NICE] += delta;
else
- cpustat[CPUTIME_USER] += vtime->utime + delta;
+ cpustat[CPUTIME_USER] += delta;
} else {
WARN_ON_ONCE(state != VTIME_GUEST);
if (task_nice(tsk) > 0) {
- cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta;
- cpustat[CPUTIME_NICE] += vtime->gtime + delta;
+ cpustat[CPUTIME_GUEST_NICE] += delta;
+ cpustat[CPUTIME_NICE] += delta;
} else {
- cpustat[CPUTIME_GUEST] += vtime->gtime + delta;
- cpustat[CPUTIME_USER] += vtime->gtime + delta;
+ cpustat[CPUTIME_GUEST] += delta;
+ cpustat[CPUTIME_USER] += delta;
}
}
} while (read_seqcount_retry(&vtime->seqcount, seq));
next prev parent reply other threads:[~2022-09-28 12:11 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-28 3:34 [PATCH -next] sched/cputime: Fix the time backward issue about /proc/stat Zucheng Zheng
2022-09-28 8:12 ` Peter Zijlstra
2022-09-28 12:11 ` Frederic Weisbecker [this message]
2022-09-30 2:43 ` zhengzucheng
2022-09-30 12:16 ` Frederic Weisbecker
2022-10-09 2:28 ` zhengzucheng
2022-09-30 12:14 ` [sched/cputime] 131c995687: BUG:spinlock_trylock_failure_on_UP_on_CPU kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220928121134.GA233658@lothringen \
--to=frederic@kernel.org \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=hucool.lihua@huawei.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=zhengzucheng@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox