* [PATCH 0/2] timers/nohz fixes @ 2021-10-26 14:10 Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker 0 siblings, 2 replies; 7+ messages in thread From: Frederic Weisbecker @ 2021-10-26 14:10 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Frederic Weisbecker, Paul E . McKenney, Peter Zijlstra, Hasegawa Hitomi, Mel Gorman A bunch of fixes that don't need to be on the urgent queue since they are not regression fixes. But they are still fixes... Frederic Weisbecker (2): timers/nohz: Last resort update jiffies on nohz_full IRQ entry sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full include/linux/sched/cputime.h | 5 +++-- kernel/sched/cputime.c | 12 +++++++++--- kernel/softirq.c | 3 ++- kernel/time/tick-sched.c | 7 +++++++ 4 files changed, 21 insertions(+), 6 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry 2021-10-26 14:10 [PATCH 0/2] timers/nohz fixes Frederic Weisbecker @ 2021-10-26 14:10 ` Frederic Weisbecker 2021-12-02 14:12 ` [tip: timers/urgent] " tip-bot2 for Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker 1 sibling, 1 reply; 7+ messages in thread From: Frederic Weisbecker @ 2021-10-26 14:10 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Frederic Weisbecker, Paul E . McKenney, Peter Zijlstra, Hasegawa Hitomi When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU is guaranteed to stay online and to never stop its tick. Meanwhile on some rare case, the dedicated timekeeper may be running with interrupts disabled for a while, such as in stop_machine. If jiffies stop being updated, a nohz_full CPU may end up endlessly programming the next tick in the past, taking the last jiffies update monotonic timestamp as a stale base, resulting in an tick storm. Here is a scenario where it matters: 0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU. 1) A stop machine callback is queued to execute somewhere. 2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1 can still take IRQs. 3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward. 4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking last_jiffies_update as a base. But last_jiffies_update hasn't been updated for 2 jiffies since the timekeeper has interrupts disabled. 5) clockevents_program_event(), which relies on ktime_get(), observes that the expiration is in the past and therefore programs the min delta event on the clock. 6) The tick fires immediately, goto 3) 7) Tick storm, the nohz_full CPU is drown and takes ages to reach MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation. Solve this with unconditionally updating jiffies if the value is stale on nohz_full IRQ entry. IRQs and other disturbances are expected to be rare enough on nohz_full for the unconditional call to ktime_get() to actually matter. Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> --- kernel/softirq.c | 3 ++- kernel/time/tick-sched.c | 7 +++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 322b65d45676..41f470929e99 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -595,7 +595,8 @@ void irq_enter_rcu(void) { __irq_enter_raw(); - if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)) + if (tick_nohz_full_cpu(smp_processor_id()) || + (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))) tick_irq_enter(); account_hardirq_enter(current); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 6bffe5af8cb1..17a283ce2b20 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1375,6 +1375,13 @@ static inline void tick_nohz_irq_enter(void) now = ktime_get(); if (ts->idle_active) tick_nohz_stop_idle(ts, now); + /* + * If all CPUs are idle. We may need to update a stale jiffies value. + * Note nohz_full is a special case: a timekeeper is guaranteed to stay + * alive but it might be busy looping with interrupts disabled in some + * rare case (typically stop machine). So we must make sure we have a + * last resort. + */ if (ts->tick_stopped) tick_nohz_update_jiffies(now); } -- 2.25.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [tip: timers/urgent] timers/nohz: Last resort update jiffies on nohz_full IRQ entry 2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker @ 2021-12-02 14:12 ` tip-bot2 for Frederic Weisbecker 0 siblings, 0 replies; 7+ messages in thread From: tip-bot2 for Frederic Weisbecker @ 2021-12-02 14:12 UTC (permalink / raw) To: linux-tip-commits Cc: Paul E. McKenney, Frederic Weisbecker, Thomas Gleixner, x86, linux-kernel The following commit has been merged into the timers/urgent branch of tip: Commit-ID: 53e87e3cdc155f20c3417b689df8d2ac88d79576 Gitweb: https://git.kernel.org/tip/53e87e3cdc155f20c3417b689df8d2ac88d79576 Author: Frederic Weisbecker <frederic@kernel.org> AuthorDate: Tue, 26 Oct 2021 16:10:54 +02:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 02 Dec 2021 15:07:22 +01:00 timers/nohz: Last resort update jiffies on nohz_full IRQ entry When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU is guaranteed to stay online and to never stop its tick. Meanwhile on some rare case, the dedicated timekeeper may be running with interrupts disabled for a while, such as in stop_machine. If jiffies stop being updated, a nohz_full CPU may end up endlessly programming the next tick in the past, taking the last jiffies update monotonic timestamp as a stale base, resulting in an tick storm. Here is a scenario where it matters: 0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU. 1) A stop machine callback is queued to execute somewhere. 2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1 can still take IRQs. 3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward. 4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking last_jiffies_update as a base. But last_jiffies_update hasn't been updated for 2 jiffies since the timekeeper has interrupts disabled. 5) clockevents_program_event(), which relies on ktime_get(), observes that the expiration is in the past and therefore programs the min delta event on the clock. 6) The tick fires immediately, goto 3) 7) Tick storm, the nohz_full CPU is drown and takes ages to reach MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation. Solve this with unconditionally updating jiffies if the value is stale on nohz_full IRQ entry. IRQs and other disturbances are expected to be rare enough on nohz_full for the unconditional call to ktime_get() to actually matter. Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org --- kernel/softirq.c | 3 ++- kernel/time/tick-sched.c | 7 +++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 322b65d..41f4709 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -595,7 +595,8 @@ void irq_enter_rcu(void) { __irq_enter_raw(); - if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)) + if (tick_nohz_full_cpu(smp_processor_id()) || + (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))) tick_irq_enter(); account_hardirq_enter(current); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 6bffe5a..17a283c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1375,6 +1375,13 @@ static inline void tick_nohz_irq_enter(void) now = ktime_get(); if (ts->idle_active) tick_nohz_stop_idle(ts, now); + /* + * If all CPUs are idle. We may need to update a stale jiffies value. + * Note nohz_full is a special case: a timekeeper is guaranteed to stay + * alive but it might be busy looping with interrupts disabled in some + * rare case (typically stop machine). So we must make sure we have a + * last resort. + */ if (ts->tick_stopped) tick_nohz_update_jiffies(now); } ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full 2021-10-26 14:10 [PATCH 0/2] timers/nohz fixes Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker @ 2021-10-26 14:10 ` Frederic Weisbecker 2021-10-26 17:40 ` Masayoshi Mizuma ` (2 more replies) 1 sibling, 3 replies; 7+ messages in thread From: Frederic Weisbecker @ 2021-10-26 14:10 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, Frederic Weisbecker, Peter Zijlstra, Hasegawa Hitomi, Mel Gorman getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime than the actual time. task_cputime_adjusted() snapshots utime and stime and then adjust their sum to match the scheduler maintained cputime.sum_exec_runtime. Unfortunately in nohz_full, sum_exec_runtime is only updated once per second in the worst case, causing a discrepancy against utime and stime that can be updated anytime by the reader using vtime. To fix this situation, perform an update of cputime.sum_exec_runtime when the cputime snapshot reports the task as actually running while the tick is disabled. The related overhead is then contained within the relevant situations. Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> --- include/linux/sched/cputime.h | 5 +++-- kernel/sched/cputime.c | 12 +++++++++--- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 6c9f19a33865..ce3c58286062 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -18,15 +18,16 @@ #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN -extern void task_cputime(struct task_struct *t, +extern bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime); extern u64 task_gtime(struct task_struct *t); #else -static inline void task_cputime(struct task_struct *t, +static inline bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) { *utime = t->utime; *stime = t->stime; + return false; } static inline u64 task_gtime(struct task_struct *t) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 872e481d5098..9392aea1804e 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) .sum_exec_runtime = p->se.sum_exec_runtime, }; - task_cputime(p, &cputime.utime, &cputime.stime); + if (task_cputime(p, &cputime.utime, &cputime.stime)) + cputime.sum_exec_runtime = task_sched_runtime(p); cputime_adjust(&cputime, &p->prev_cputime, ut, st); } EXPORT_SYMBOL_GPL(task_cputime_adjusted); @@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t) * add up the pending nohz execution time since the last * cputime snapshot. */ -void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) +bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) { struct vtime *vtime = &t->vtime; unsigned int seq; u64 delta; + int ret; if (!vtime_accounting_enabled()) { *utime = t->utime; *stime = t->stime; - return; + return false; } do { + ret = false; seq = read_seqcount_begin(&vtime->seqcount); *utime = t->utime; @@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) if (vtime->state < VTIME_SYS) continue; + ret = true; delta = vtime_delta(vtime); /* @@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) else *utime += vtime->utime + delta; } while (read_seqcount_retry(&vtime->seqcount, seq)); + + return ret; } static int vtime_state_fetch(struct vtime *vtime, int cpu) -- 2.25.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker @ 2021-10-26 17:40 ` Masayoshi Mizuma 2021-11-10 19:30 ` Phil Auld 2021-12-02 14:12 ` [tip: sched/urgent] " tip-bot2 for Frederic Weisbecker 2 siblings, 0 replies; 7+ messages in thread From: Masayoshi Mizuma @ 2021-10-26 17:40 UTC (permalink / raw) To: Frederic Weisbecker Cc: Thomas Gleixner, LKML, Peter Zijlstra, Hasegawa Hitomi, Mel Gorman On Tue, Oct 26, 2021 at 04:10:55PM +0200, Frederic Weisbecker wrote: > getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime > than the actual time. > > task_cputime_adjusted() snapshots utime and stime and then adjust their > sum to match the scheduler maintained cputime.sum_exec_runtime. > Unfortunately in nohz_full, sum_exec_runtime is only updated once per > second in the worst case, causing a discrepancy against utime and stime > that can be updated anytime by the reader using vtime. > > To fix this situation, perform an update of cputime.sum_exec_runtime > when the cputime snapshot reports the task as actually running while > the tick is disabled. The related overhead is then contained within the > relevant situations. > > Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> Thank you for this patch. getrusage(RUSAGE_THREAD) with nohz_full works well! Please feel free to add: Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Thanks, Masa > --- > include/linux/sched/cputime.h | 5 +++-- > kernel/sched/cputime.c | 12 +++++++++--- > 2 files changed, 12 insertions(+), 5 deletions(-) > > diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h > index 6c9f19a33865..ce3c58286062 100644 > --- a/include/linux/sched/cputime.h > +++ b/include/linux/sched/cputime.h > @@ -18,15 +18,16 @@ > #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ > > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > -extern void task_cputime(struct task_struct *t, > +extern bool task_cputime(struct task_struct *t, > u64 *utime, u64 *stime); > extern u64 task_gtime(struct task_struct *t); > #else > -static inline void task_cputime(struct task_struct *t, > +static inline bool task_cputime(struct task_struct *t, > u64 *utime, u64 *stime) > { > *utime = t->utime; > *stime = t->stime; > + return false; > } > > static inline u64 task_gtime(struct task_struct *t) > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 872e481d5098..9392aea1804e 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) > .sum_exec_runtime = p->se.sum_exec_runtime, > }; > > - task_cputime(p, &cputime.utime, &cputime.stime); > + if (task_cputime(p, &cputime.utime, &cputime.stime)) > + cputime.sum_exec_runtime = task_sched_runtime(p); > cputime_adjust(&cputime, &p->prev_cputime, ut, st); > } > EXPORT_SYMBOL_GPL(task_cputime_adjusted); > @@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t) > * add up the pending nohz execution time since the last > * cputime snapshot. > */ > -void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > +bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > { > struct vtime *vtime = &t->vtime; > unsigned int seq; > u64 delta; > + int ret; > > if (!vtime_accounting_enabled()) { > *utime = t->utime; > *stime = t->stime; > - return; > + return false; > } > > do { > + ret = false; > seq = read_seqcount_begin(&vtime->seqcount); > > *utime = t->utime; > @@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > if (vtime->state < VTIME_SYS) > continue; > > + ret = true; > delta = vtime_delta(vtime); > > /* > @@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > else > *utime += vtime->utime + delta; > } while (read_seqcount_retry(&vtime->seqcount, seq)); > + > + return ret; > } > > static int vtime_state_fetch(struct vtime *vtime, int cpu) > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker 2021-10-26 17:40 ` Masayoshi Mizuma @ 2021-11-10 19:30 ` Phil Auld 2021-12-02 14:12 ` [tip: sched/urgent] " tip-bot2 for Frederic Weisbecker 2 siblings, 0 replies; 7+ messages in thread From: Phil Auld @ 2021-11-10 19:30 UTC (permalink / raw) To: Frederic Weisbecker Cc: Thomas Gleixner, LKML, Peter Zijlstra, Hasegawa Hitomi, Mel Gorman Hi, On Tue, Oct 26, 2021 at 04:10:55PM +0200 Frederic Weisbecker wrote: > getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime > than the actual time. > > task_cputime_adjusted() snapshots utime and stime and then adjust their > sum to match the scheduler maintained cputime.sum_exec_runtime. > Unfortunately in nohz_full, sum_exec_runtime is only updated once per > second in the worst case, causing a discrepancy against utime and stime > that can be updated anytime by the reader using vtime. > > To fix this situation, perform an update of cputime.sum_exec_runtime > when the cputime snapshot reports the task as actually running while > the tick is disabled. The related overhead is then contained within the > relevant situations. > > Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> > Signed-off-by: Frederic Weisbecker <frederic@kernel.org> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> > --- > include/linux/sched/cputime.h | 5 +++-- > kernel/sched/cputime.c | 12 +++++++++--- > 2 files changed, 12 insertions(+), 5 deletions(-) > > diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h > index 6c9f19a33865..ce3c58286062 100644 > --- a/include/linux/sched/cputime.h > +++ b/include/linux/sched/cputime.h > @@ -18,15 +18,16 @@ > #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ > > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > -extern void task_cputime(struct task_struct *t, > +extern bool task_cputime(struct task_struct *t, > u64 *utime, u64 *stime); > extern u64 task_gtime(struct task_struct *t); > #else > -static inline void task_cputime(struct task_struct *t, > +static inline bool task_cputime(struct task_struct *t, > u64 *utime, u64 *stime) > { > *utime = t->utime; > *stime = t->stime; > + return false; > } > > static inline u64 task_gtime(struct task_struct *t) > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 872e481d5098..9392aea1804e 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) > .sum_exec_runtime = p->se.sum_exec_runtime, > }; > > - task_cputime(p, &cputime.utime, &cputime.stime); > + if (task_cputime(p, &cputime.utime, &cputime.stime)) > + cputime.sum_exec_runtime = task_sched_runtime(p); > cputime_adjust(&cputime, &p->prev_cputime, ut, st); > } > EXPORT_SYMBOL_GPL(task_cputime_adjusted); > @@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t) > * add up the pending nohz execution time since the last > * cputime snapshot. > */ > -void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > +bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > { > struct vtime *vtime = &t->vtime; > unsigned int seq; > u64 delta; > + int ret; > > if (!vtime_accounting_enabled()) { > *utime = t->utime; > *stime = t->stime; > - return; > + return false; > } > > do { > + ret = false; > seq = read_seqcount_begin(&vtime->seqcount); > > *utime = t->utime; > @@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > if (vtime->state < VTIME_SYS) > continue; > > + ret = true; > delta = vtime_delta(vtime); > > /* > @@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) > else > *utime += vtime->utime + delta; > } while (read_seqcount_retry(&vtime->seqcount, seq)); > + > + return ret; > } > > static int vtime_state_fetch(struct vtime *vtime, int cpu) > -- > 2.25.1 > Could someone please pick this (or, rather, these) up? Acked-by: Phil Auld <pauld@redhat.com> Thanks! Phil -- ^ permalink raw reply [flat|nested] 7+ messages in thread
* [tip: sched/urgent] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker 2021-10-26 17:40 ` Masayoshi Mizuma 2021-11-10 19:30 ` Phil Auld @ 2021-12-02 14:12 ` tip-bot2 for Frederic Weisbecker 2 siblings, 0 replies; 7+ messages in thread From: tip-bot2 for Frederic Weisbecker @ 2021-12-02 14:12 UTC (permalink / raw) To: linux-tip-commits Cc: Hasegawa Hitomi, Frederic Weisbecker, Thomas Gleixner, Masayoshi Mizuma, Phil Auld, x86, linux-kernel The following commit has been merged into the sched/urgent branch of tip: Commit-ID: e7f2be115f0746b969c0df14c0d182f65f005ca5 Gitweb: https://git.kernel.org/tip/e7f2be115f0746b969c0df14c0d182f65f005ca5 Author: Frederic Weisbecker <frederic@kernel.org> AuthorDate: Tue, 26 Oct 2021 16:10:55 +02:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Thu, 02 Dec 2021 15:08:22 +01:00 sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime than the actual time. task_cputime_adjusted() snapshots utime and stime and then adjust their sum to match the scheduler maintained cputime.sum_exec_runtime. Unfortunately in nohz_full, sum_exec_runtime is only updated once per second in the worst case, causing a discrepancy against utime and stime that can be updated anytime by the reader using vtime. To fix this situation, perform an update of cputime.sum_exec_runtime when the cputime snapshot reports the task as actually running while the tick is disabled. The related overhead is then contained within the relevant situations. Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Acked-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20211026141055.57358-3-frederic@kernel.org --- include/linux/sched/cputime.h | 5 +++-- kernel/sched/cputime.c | 12 +++++++++--- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h index 6c9f19a..ce3c582 100644 --- a/include/linux/sched/cputime.h +++ b/include/linux/sched/cputime.h @@ -18,15 +18,16 @@ #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN -extern void task_cputime(struct task_struct *t, +extern bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime); extern u64 task_gtime(struct task_struct *t); #else -static inline void task_cputime(struct task_struct *t, +static inline bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) { *utime = t->utime; *stime = t->stime; + return false; } static inline u64 task_gtime(struct task_struct *t) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 872e481..9392aea 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) .sum_exec_runtime = p->se.sum_exec_runtime, }; - task_cputime(p, &cputime.utime, &cputime.stime); + if (task_cputime(p, &cputime.utime, &cputime.stime)) + cputime.sum_exec_runtime = task_sched_runtime(p); cputime_adjust(&cputime, &p->prev_cputime, ut, st); } EXPORT_SYMBOL_GPL(task_cputime_adjusted); @@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t) * add up the pending nohz execution time since the last * cputime snapshot. */ -void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) +bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime) { struct vtime *vtime = &t->vtime; unsigned int seq; u64 delta; + int ret; if (!vtime_accounting_enabled()) { *utime = t->utime; *stime = t->stime; - return; + return false; } do { + ret = false; seq = read_seqcount_begin(&vtime->seqcount); *utime = t->utime; @@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) if (vtime->state < VTIME_SYS) continue; + ret = true; delta = vtime_delta(vtime); /* @@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) else *utime += vtime->utime + delta; } while (read_seqcount_retry(&vtime->seqcount, seq)); + + return ret; } static int vtime_state_fetch(struct vtime *vtime, int cpu) ^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-12-02 14:12 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-10-26 14:10 [PATCH 0/2] timers/nohz fixes Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 1/2 RESEND] timers/nohz: Last resort update jiffies on nohz_full IRQ entry Frederic Weisbecker 2021-12-02 14:12 ` [tip: timers/urgent] " tip-bot2 for Frederic Weisbecker 2021-10-26 14:10 ` [PATCH 2/2] sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full Frederic Weisbecker 2021-10-26 17:40 ` Masayoshi Mizuma 2021-11-10 19:30 ` Phil Auld 2021-12-02 14:12 ` [tip: sched/urgent] " tip-bot2 for Frederic Weisbecker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox