public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: tip-bot for Wanpeng Li <tipbot@zytor.com>
To: linux-tip-commits@vger.kernel.org
Cc: hpa@zytor.com, linux-kernel@vger.kernel.org,
	torvalds@linux-foundation.org, mingo@kernel.org,
	fweisbec@gmail.com, peterz@infradead.org, lcapitulino@redhat.com,
	kernellwp@gmail.com, wanpeng.li@hotmail.com, tglx@linutronix.de,
	riel@redhat.com
Subject: [tip:sched/urgent] sched/cputime: Accumulate vtime on top of nsec clocksource
Date: Wed, 5 Jul 2017 03:29:40 -0700	[thread overview]
Message-ID: <tip-2a42eb9594a1480b4ead9e036e06ee1290e5fa6d@git.kernel.org> (raw)
In-Reply-To: <1498756511-11714-6-git-send-email-fweisbec@gmail.com>

Commit-ID:  2a42eb9594a1480b4ead9e036e06ee1290e5fa6d
Gitweb:     http://git.kernel.org/tip/2a42eb9594a1480b4ead9e036e06ee1290e5fa6d
Author:     Wanpeng Li <wanpeng.li@hotmail.com>
AuthorDate: Thu, 29 Jun 2017 19:15:11 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 5 Jul 2017 09:54:15 +0200

sched/cputime: Accumulate vtime on top of nsec clocksource

Currently the cputime source used by vtime is jiffies. When we cross
a context boundary and jiffies have changed since the last snapshot, the
pending cputime is accounted to the switching out context.

This system works ok if the ticks are not aligned across CPUs. If they
instead are aligned (ie: all fire at the same time) and the CPUs run in
userspace, the jiffies change is only observed on tick exit and therefore
the user cputime is accounted as system cputime. This is because the
CPU that maintains timekeeping fires its tick at the same time as the
others. It updates jiffies in the middle of the tick and the other CPUs
see that update on IRQ exit:

    CPU 0 (timekeeper)                  CPU 1
    -------------------              -------------
                      jiffies = N
    ...                              run in userspace for a jiffy
    tick entry                       tick entry (sees jiffies = N)
    set jiffies = N + 1
    tick exit                        tick exit (sees jiffies = N + 1)
                                                account 1 jiffy as stime

Fix this with using a nanosec clock source instead of jiffies. The
cputime is then accumulated and flushed everytime the pending delta
reaches a jiffy in order to mitigate the accounting overhead.

[ fweisbec: changelog, rebase on struct vtime, field renames, add delta
  on cputime readers, keep idle vtime as-is (low overhead accounting),
  harmonize clock sources. ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h  |  3 +++
 kernel/sched/cputime.c | 64 +++++++++++++++++++++++++++++++++-----------------
 2 files changed, 45 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eeff8a0..4818126 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -236,6 +236,9 @@ struct vtime {
 	seqcount_t		seqcount;
 	unsigned long long	starttime;
 	enum vtime_state	state;
+	u64			utime;
+	u64			stime;
+	u64			gtime;
 };
 
 struct sched_info {
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 9ee725e..6e3ea4a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -681,18 +681,19 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static u64 vtime_delta(struct vtime *vtime)
 {
-	unsigned long now = READ_ONCE(jiffies);
+	unsigned long long clock;
 
-	if (time_before(now, (unsigned long)vtime->starttime))
+	clock = sched_clock_cpu(smp_processor_id());
+	if (clock < vtime->starttime)
 		return 0;
 
-	return jiffies_to_nsecs(now - vtime->starttime);
+	return clock - vtime->starttime;
 }
 
 static u64 get_vtime_delta(struct vtime *vtime)
 {
-	unsigned long now = READ_ONCE(jiffies);
-	u64 delta, other;
+	u64 delta = vtime_delta(vtime);
+	u64 other;
 
 	/*
 	 * Unlike tick based timing, vtime based timing never has lost
@@ -701,17 +702,31 @@ static u64 get_vtime_delta(struct vtime *vtime)
 	 * elapsed time. Limit account_other_time to prevent rounding
 	 * errors from causing elapsed vtime to go negative.
 	 */
-	delta = jiffies_to_nsecs(now - vtime->starttime);
 	other = account_other_time(delta);
 	WARN_ON_ONCE(vtime->state == VTIME_INACTIVE);
-	vtime->starttime = now;
+	vtime->starttime += delta;
 
 	return delta - other;
 }
 
-static void __vtime_account_system(struct task_struct *tsk)
+static void __vtime_account_system(struct task_struct *tsk,
+				   struct vtime *vtime)
 {
-	account_system_time(tsk, irq_count(), get_vtime_delta(&tsk->vtime));
+	vtime->stime += get_vtime_delta(vtime);
+	if (vtime->stime >= TICK_NSEC) {
+		account_system_time(tsk, irq_count(), vtime->stime);
+		vtime->stime = 0;
+	}
+}
+
+static void vtime_account_guest(struct task_struct *tsk,
+				struct vtime *vtime)
+{
+	vtime->gtime += get_vtime_delta(vtime);
+	if (vtime->gtime >= TICK_NSEC) {
+		account_guest_time(tsk, vtime->gtime);
+		vtime->gtime = 0;
+	}
 }
 
 void vtime_account_system(struct task_struct *tsk)
@@ -722,7 +737,11 @@ void vtime_account_system(struct task_struct *tsk)
 		return;
 
 	write_seqcount_begin(&vtime->seqcount);
-	__vtime_account_system(tsk);
+	/* We might have scheduled out from guest path */
+	if (current->flags & PF_VCPU)
+		vtime_account_guest(tsk, vtime);
+	else
+		__vtime_account_system(tsk, vtime);
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -731,8 +750,7 @@ void vtime_user_enter(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		__vtime_account_system(tsk);
+	__vtime_account_system(tsk, vtime);
 	vtime->state = VTIME_USER;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -742,8 +760,11 @@ void vtime_user_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		account_user_time(tsk, get_vtime_delta(vtime));
+	vtime->utime += get_vtime_delta(vtime);
+	if (vtime->utime >= TICK_NSEC) {
+		account_user_time(tsk, vtime->utime);
+		vtime->utime = 0;
+	}
 	vtime->state = VTIME_SYS;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -759,8 +780,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		__vtime_account_system(tsk);
+	__vtime_account_system(tsk, vtime);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -771,7 +791,7 @@ void vtime_guest_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	__vtime_account_system(tsk);
+	vtime_account_guest(tsk, vtime);
 	current->flags &= ~PF_VCPU;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -794,7 +814,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 
 	write_seqcount_begin(&vtime->seqcount);
 	vtime->state = VTIME_SYS;
-	vtime->starttime = jiffies;
+	vtime->starttime = sched_clock_cpu(smp_processor_id());
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -806,7 +826,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	local_irq_save(flags);
 	write_seqcount_begin(&vtime->seqcount);
 	vtime->state = VTIME_SYS;
-	vtime->starttime = jiffies;
+	vtime->starttime = sched_clock_cpu(cpu);
 	write_seqcount_end(&vtime->seqcount);
 	local_irq_restore(flags);
 }
@@ -825,7 +845,7 @@ u64 task_gtime(struct task_struct *t)
 
 		gtime = t->gtime;
 		if (vtime->state == VTIME_SYS && t->flags & PF_VCPU)
-			gtime += vtime_delta(vtime);
+			gtime += vtime->gtime + vtime_delta(vtime);
 
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
@@ -866,9 +886,9 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		 * the right place.
 		 */
 		if (vtime->state == VTIME_USER || t->flags & PF_VCPU)
-			*utime += delta;
+			*utime += vtime->utime + delta;
 		else if (vtime->state == VTIME_SYS)
-			*stime += delta;
+			*stime += vtime->stime + delta;
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */

  parent reply	other threads:[~2017-07-05 10:34 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-29 17:15 [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Frederic Weisbecker
2017-06-29 17:15 ` [PATCH 1/5] vtime: Remove vtime_account_user() Frederic Weisbecker
2017-06-29 23:01   ` Rik van Riel
2017-07-05 10:28   ` [tip:sched/urgent] vtime, sched/cputime: " tip-bot for Frederic Weisbecker
2017-06-29 17:15 ` [PATCH 2/5] sched: Always set vtime_snap_whence after accounting vtime Frederic Weisbecker
2017-06-29 23:01   ` Rik van Riel
2017-07-05 10:28   ` [tip:sched/urgent] sched/cputime: Always set tsk->vtime_snap_whence " tip-bot for Frederic Weisbecker
2017-06-29 17:15 ` [PATCH 3/5] sched: Rename vtime fields Frederic Weisbecker
2017-06-29 23:02   ` Rik van Riel
2017-07-05 10:28   ` [tip:sched/urgent] sched/cputime: " tip-bot for Frederic Weisbecker
2017-06-29 17:15 ` [PATCH 4/5] sched: Move vtime task fields to their own struct Frederic Weisbecker
2017-06-29 23:05   ` Rik van Riel
2017-07-05 10:29   ` [tip:sched/urgent] sched/cputime: Move the " tip-bot for Frederic Weisbecker
2017-06-29 17:15 ` [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource Frederic Weisbecker
2017-06-29 23:27   ` Rik van Riel
2017-07-05 13:20     ` Frederic Weisbecker
2017-06-30  1:52   ` Wanpeng Li
2017-07-05 10:29   ` tip-bot for Wanpeng Li [this message]
2017-07-15  3:37   ` Levin, Alexander (Sasha Levin)
2017-07-15  5:26     ` Wanpeng Li
2017-06-30  1:41 ` [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Wanpeng Li
2017-06-30 17:32   ` Luiz Capitulino
2017-07-03 10:28 ` Thomas Gleixner
2017-07-04 16:52 ` Luiz Capitulino
2017-07-05 13:16   ` Frederic Weisbecker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=tip-2a42eb9594a1480b4ead9e036e06ee1290e5fa6d@git.kernel.org \
    --to=tipbot@zytor.com \
    --cc=fweisbec@gmail.com \
    --cc=hpa@zytor.com \
    --cc=kernellwp@gmail.com \
    --cc=lcapitulino@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=wanpeng.li@hotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox