* [PATCH] proc: Add workaround for idle/iowait decreasing problem. [not found] ` <201304012205.DFC60784.HVOMJSFFLFtOOQ@I-love.SAKURA.ne.jp> @ 2013-04-23 12:45 ` Tetsuo Handa 2013-04-28 0:49 ` Frederic Weisbecker 0 siblings, 1 reply; 6+ messages in thread From: Tetsuo Handa @ 2013-04-23 12:45 UTC (permalink / raw) To: tglx, fweisbec; +Cc: linux-kernel, linux-fsdevel, fernando_b1 CONFIG_NO_HZ=y can cause idle/iowait values to decrease. If /proc/stat is monitored with a short interval (e.g. 1 or 2 secs) using sysstat package, sar reports bogus %idle/iowait values because sar expects that idle/iowait values do not decrease unless wraparound happens. This patch makes idle/iowait values visible from /proc/stat increase monotonically, with an assumption that we don't need to worry about wraparound. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- fs/proc/stat.c | 42 ++++++++++++++++++++++++++++++++++++++---- 1 files changed, 38 insertions(+), 4 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..9fff534 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -19,6 +19,40 @@ #define arch_irq_stat() 0 #endif +/* + * CONFIG_NO_HZ=y can cause idle/iowait values to decrease. + * Make sure that idle/iowait values visible from /proc/stat do not decrease. + */ +static inline u64 validate_iowait(u64 iowait, const int cpu) +{ +#ifdef CONFIG_NO_HZ + static u64 max_iowait[NR_CPUS]; + static DEFINE_SPINLOCK(lock); + spin_lock(&lock); + if (likely(iowait >= max_iowait[cpu])) + max_iowait[cpu] = iowait; + else + iowait = max_iowait[cpu]; + spin_unlock(&lock); +#endif + return iowait; +} + +static inline u64 validate_idle(u64 idle, const int cpu) +{ +#ifdef CONFIG_NO_HZ + static u64 max_idle[NR_CPUS]; + static DEFINE_SPINLOCK(lock); + spin_lock(&lock); + if (likely(idle >= max_idle[cpu])) + max_idle[cpu] = idle; + else + idle = max_idle[cpu]; + spin_unlock(&lock); +#endif + return idle; +} + #ifdef arch_idle_time static cputime64_t get_idle_time(int cpu) @@ -28,7 +62,7 @@ static cputime64_t get_idle_time(int cpu) idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE]; if (cpu_online(cpu) && !nr_iowait_cpu(cpu)) idle += arch_idle_time(cpu); - return idle; + return validate_idle(idle, cpu); } static cputime64_t get_iowait_time(int cpu) @@ -38,7 +72,7 @@ static cputime64_t get_iowait_time(int cpu) iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT]; if (cpu_online(cpu) && nr_iowait_cpu(cpu)) iowait += arch_idle_time(cpu); - return iowait; + return validate_iowait(iowait, cpu); } #else @@ -56,7 +90,7 @@ static u64 get_idle_time(int cpu) else idle = usecs_to_cputime64(idle_time); - return idle; + return validate_idle(idle, cpu); } static u64 get_iowait_time(int cpu) @@ -72,7 +106,7 @@ static u64 get_iowait_time(int cpu) else iowait = usecs_to_cputime64(iowait_time); - return iowait; + return validate_iowait(iowait, cpu); } #endif -- 1.7.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] proc: Add workaround for idle/iowait decreasing problem. 2013-04-23 12:45 ` [PATCH] proc: Add workaround for idle/iowait decreasing problem Tetsuo Handa @ 2013-04-28 0:49 ` Frederic Weisbecker 2013-07-02 3:56 ` Fernando Luis Vazquez Cao 0 siblings, 1 reply; 6+ messages in thread From: Frederic Weisbecker @ 2013-04-28 0:49 UTC (permalink / raw) To: Tetsuo Handa Cc: tglx, linux-kernel, linux-fsdevel, fernando_b1, Ingo Molnar, Peter Zijlstra, Andrew Morton On Tue, Apr 23, 2013 at 09:45:23PM +0900, Tetsuo Handa wrote: > CONFIG_NO_HZ=y can cause idle/iowait values to decrease. > > If /proc/stat is monitored with a short interval (e.g. 1 or 2 secs) using > sysstat package, sar reports bogus %idle/iowait values because sar expects > that idle/iowait values do not decrease unless wraparound happens. > > This patch makes idle/iowait values visible from /proc/stat increase > monotonically, with an assumption that we don't need to worry about > wraparound. > > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> It's not clear in the changelog why you see non-monotonic idle/iowait values. Looking at the previous patch from Fernando, it seems that's because we can race with concurrent updates from the CPU target when it wakes up from idle? (could be updated by drivers/cpufreq/cpufreq_governor.c as well). If so the bug has another symptom: we may also report a wrong iowait/idle time by accounting the last idle time twice. In this case we should fix the bug from the source, for example we can force the given ordering: = Write side = = Read side = // tick_nohz_start_idle() write_seqcount_begin(ts->seq) ts->idle_entrytime = now ts->idle_active = 1 write_seqcount_end(ts->seq) // tick_nohz_stop_idle() write_seqcount_begin(ts->seq) ts->iowait_sleeptime += now - ts->idle_entrytime t->idle_active = 0 write_seqcount_end(ts->seq) // get_cpu_iowait_time_us() do { seq = read_seqcount_begin(ts->seq) if (t->idle_active) { time = now - ts->idle_entrytime time += ts->iowait_sleeptime } else { time = ts->iowait_sleeptime } } while (read_seqcount_retry(ts->seq, seq)); Right? seqcount should be enough to make sure we are getting a consistent result. I doubt we need harder locking. Another thing while at it. It seems that an update done from drivers/cpufreq/cpufreq_governor.c (calling get_cpu_iowait_time_us() -> update_ts_time_stats()) can randomly race with a CPU entering/exiting idle. I have no idea why drivers/cpufreq/cpufreq_governor.c does the update itself. It can just compute the delta like any reader. May be we could remove that and only ever call update_ts_time_stats() from the CPU that exit idle. What do you think? Thanks. Frederic. > --- > fs/proc/stat.c | 42 ++++++++++++++++++++++++++++++++++++++---- > 1 files changed, 38 insertions(+), 4 deletions(-) > > diff --git a/fs/proc/stat.c b/fs/proc/stat.c > index e296572..9fff534 100644 > --- a/fs/proc/stat.c > +++ b/fs/proc/stat.c > @@ -19,6 +19,40 @@ > #define arch_irq_stat() 0 > #endif > > +/* > + * CONFIG_NO_HZ=y can cause idle/iowait values to decrease. > + * Make sure that idle/iowait values visible from /proc/stat do not decrease. > + */ > +static inline u64 validate_iowait(u64 iowait, const int cpu) > +{ > +#ifdef CONFIG_NO_HZ > + static u64 max_iowait[NR_CPUS]; > + static DEFINE_SPINLOCK(lock); > + spin_lock(&lock); > + if (likely(iowait >= max_iowait[cpu])) > + max_iowait[cpu] = iowait; > + else > + iowait = max_iowait[cpu]; > + spin_unlock(&lock); > +#endif > + return iowait; > +} > + > +static inline u64 validate_idle(u64 idle, const int cpu) > +{ > +#ifdef CONFIG_NO_HZ > + static u64 max_idle[NR_CPUS]; > + static DEFINE_SPINLOCK(lock); > + spin_lock(&lock); > + if (likely(idle >= max_idle[cpu])) > + max_idle[cpu] = idle; > + else > + idle = max_idle[cpu]; > + spin_unlock(&lock); > +#endif > + return idle; > +} > + > #ifdef arch_idle_time > > static cputime64_t get_idle_time(int cpu) > @@ -28,7 +62,7 @@ static cputime64_t get_idle_time(int cpu) > idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE]; > if (cpu_online(cpu) && !nr_iowait_cpu(cpu)) > idle += arch_idle_time(cpu); > - return idle; > + return validate_idle(idle, cpu); > } > > static cputime64_t get_iowait_time(int cpu) > @@ -38,7 +72,7 @@ static cputime64_t get_iowait_time(int cpu) > iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT]; > if (cpu_online(cpu) && nr_iowait_cpu(cpu)) > iowait += arch_idle_time(cpu); > - return iowait; > + return validate_iowait(iowait, cpu); > } > > #else > @@ -56,7 +90,7 @@ static u64 get_idle_time(int cpu) > else > idle = usecs_to_cputime64(idle_time); > > - return idle; > + return validate_idle(idle, cpu); > } > > static u64 get_iowait_time(int cpu) > @@ -72,7 +106,7 @@ static u64 get_iowait_time(int cpu) > else > iowait = usecs_to_cputime64(iowait_time); > > - return iowait; > + return validate_iowait(iowait, cpu); > } > > #endif > -- > 1.7.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] proc: Add workaround for idle/iowait decreasing problem. 2013-04-28 0:49 ` Frederic Weisbecker @ 2013-07-02 3:56 ` Fernando Luis Vazquez Cao 2013-07-02 10:39 ` Fernando Luis Vazquez Cao 2013-08-07 0:12 ` Frederic Weisbecker 0 siblings, 2 replies; 6+ messages in thread From: Fernando Luis Vazquez Cao @ 2013-07-02 3:56 UTC (permalink / raw) To: Frederic Weisbecker Cc: Tetsuo Handa, tglx, linux-kernel, linux-fsdevel, Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven Hi Frederic, I'm sorry it's taken me so long to respond; I got sidetracked for a while. Comments follow below. On 2013/04/28 09:49, Frederic Weisbecker wrote: > On Tue, Apr 23, 2013 at 09:45:23PM +0900, Tetsuo Handa wrote: >> CONFIG_NO_HZ=y can cause idle/iowait values to decrease. [...] > It's not clear in the changelog why you see non-monotonic idle/iowait values. > > Looking at the previous patch from Fernando, it seems that's because we can > race with concurrent updates from the CPU target when it wakes up from idle? > (could be updated by drivers/cpufreq/cpufreq_governor.c as well). > > If so the bug has another symptom: we may also report a wrong iowait/idle time > by accounting the last idle time twice. > > In this case we should fix the bug from the source, for example we can force > the given ordering: > > = Write side = = Read side = > > // tick_nohz_start_idle() > write_seqcount_begin(ts->seq) > ts->idle_entrytime = now > ts->idle_active = 1 > write_seqcount_end(ts->seq) > > // tick_nohz_stop_idle() > write_seqcount_begin(ts->seq) > ts->iowait_sleeptime += now - ts->idle_entrytime > t->idle_active = 0 > write_seqcount_end(ts->seq) > > // get_cpu_iowait_time_us() > do { > seq = read_seqcount_begin(ts->seq) > if (t->idle_active) { > time = now - ts->idle_entrytime > time += ts->iowait_sleeptime > } else { > time = ts->iowait_sleeptime > } > } while (read_seqcount_retry(ts->seq, seq)); > > Right? seqcount should be enough to make sure we are getting a consistent result. > I doubt we need harder locking. I tried that and it doesn't suffice. The problem that causes the most serious skews is related to the CPU scheduler: the per-run queue counter nr_iowait can be updated not only from the CPU it belongs to but also from any other CPU if tasks are migrated out while waiting on I/O. The race looks like this: CPU0 CPU1 [ CPU1_rq->nr_iowait == 0 ] Task foo: io_schedule() schedule() [ CPU1_rq->nr_iowait == 1) ] Task foo migrated to CPU0 Goes to sleep // get_cpu_iowait_time_us(1, NULL) [ CPU1_ts->idle_active == 1, CPU1_rq->nr_iowait == 1 ] [ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 3 ] now = 5 delta = 5 - 3 = 2 iowait = 4 + 2 = 6 Task foo wakes up [ CPU1_rq->nr_iowait == 0 ] CPU1 comes out of sleep state tick_nohz_stop_idle() update_ts_time_stats() [ CPU1_ts->idle_active == 1, CPU1_rq->nr_iowait == 0 ] [ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 3 ] now = 6 delta = 6 - 3 = 3 (CPU1_ts->iowait_sleeptime is not updated) CPU1_ts->idle_entrytime = now = 6 CPU1_ts->idle_active = 0 // get_cpu_iowait_time_us(1, NULL) [ CPU1_ts->idle_active == 0, CPU1_rq->nr_iowait == 0 ] [ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 6 ] iowait = CPU1_ts->iowait_sleeptime = 4 (iowait decreased from 6 to 4) > Another thing while at it. It seems that an update done from drivers/cpufreq/cpufreq_governor.c > (calling get_cpu_iowait_time_us() -> update_ts_time_stats()) can randomly race with a CPU > entering/exiting idle. I have no idea why drivers/cpufreq/cpufreq_governor.c does the update > itself. It can just compute the delta like any reader. May be we could remove that and only > ever call update_ts_time_stats() from the CPU that exit idle. > > What do you think? I am all for it. We just need to make sure that CPU governors can cope with non-monotonic idle and iowait times. I'll take a closer look at the code but I wouldn't mind if Arjan (CCed) beat me at that. Thanks, Fernando ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] proc: Add workaround for idle/iowait decreasing problem. 2013-07-02 3:56 ` Fernando Luis Vazquez Cao @ 2013-07-02 10:39 ` Fernando Luis Vazquez Cao 2013-08-07 0:58 ` Frederic Weisbecker 2013-08-07 0:12 ` Frederic Weisbecker 1 sibling, 1 reply; 6+ messages in thread From: Fernando Luis Vazquez Cao @ 2013-07-02 10:39 UTC (permalink / raw) To: Frederic Weisbecker Cc: Tetsuo Handa, tglx, linux-kernel, linux-fsdevel, Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven On 2013年07月02日 12:56, Fernando Luis Vazquez Cao wrote: > Hi Frederic, > > I'm sorry it's taken me so long to respond; I got sidetracked for > a while. Comments follow below. > > On 2013/04/28 09:49, Frederic Weisbecker wrote: >> On Tue, Apr 23, 2013 at 09:45:23PM +0900, Tetsuo Handa wrote: >>> CONFIG_NO_HZ=y can cause idle/iowait values to decrease. > [...] >> It's not clear in the changelog why you see non-monotonic idle/iowait >> values. >> >> Looking at the previous patch from Fernando, it seems that's because >> we can >> race with concurrent updates from the CPU target when it wakes up >> from idle? >> (could be updated by drivers/cpufreq/cpufreq_governor.c as well). >> >> If so the bug has another symptom: we may also report a wrong >> iowait/idle time >> by accounting the last idle time twice. >> >> In this case we should fix the bug from the source, for example we >> can force >> the given ordering: >> >> = Write side = = Read side = >> >> // tick_nohz_start_idle() >> write_seqcount_begin(ts->seq) >> ts->idle_entrytime = now >> ts->idle_active = 1 >> write_seqcount_end(ts->seq) >> >> // tick_nohz_stop_idle() >> write_seqcount_begin(ts->seq) >> ts->iowait_sleeptime += now - ts->idle_entrytime >> t->idle_active = 0 >> write_seqcount_end(ts->seq) >> >> // get_cpu_iowait_time_us() >> do { >> seq = >> read_seqcount_begin(ts->seq) >> if (t->idle_active) { >> time = now - >> ts->idle_entrytime >> time += >> ts->iowait_sleeptime >> } else { >> time = >> ts->iowait_sleeptime >> } >> } while >> (read_seqcount_retry(ts->seq, seq)); >> >> Right? seqcount should be enough to make sure we are getting a >> consistent result. >> I doubt we need harder locking. > > I tried that and it doesn't suffice. The problem that causes the most > serious skews is related to the CPU scheduler: the per-run queue > counter nr_iowait can be updated not only from the CPU it belongs > to but also from any other CPU if tasks are migrated out while > waiting on I/O. > > The race looks like this: > > CPU0 CPU1 > [ CPU1_rq->nr_iowait == 0 ] > Task foo: io_schedule() > schedule() > [ CPU1_rq->nr_iowait == 1) ] > Task foo migrated to CPU0 > Goes to sleep > > // get_cpu_iowait_time_us(1, NULL) > [ CPU1_ts->idle_active == 1, CPU1_rq->nr_iowait == 1 ] > [ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 3 ] > now = 5 > delta = 5 - 3 = 2 > iowait = 4 + 2 = 6 > > Task foo wakes up > [ CPU1_rq->nr_iowait == 0 ] > > CPU1 comes out of sleep state > tick_nohz_stop_idle() > update_ts_time_stats() > [ CPU1_ts->idle_active == 1, > CPU1_rq->nr_iowait == 0 ] > [ CPU1_ts->iowait_sleeptime = 4, > CPU1_ts->idle_entrytime = 3 ] > now = 6 > delta = 6 - 3 = 3 > (CPU1_ts->iowait_sleeptime is not > updated) > CPU1_ts->idle_entrytime = now = 6 > CPU1_ts->idle_active = 0 > > // get_cpu_iowait_time_us(1, NULL) > [ CPU1_ts->idle_active == 0, CPU1_rq->nr_iowait == 0 ] > [ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 6 ] > iowait = CPU1_ts->iowait_sleeptime = 4 > (iowait decreased from 6 to 4) A possible solution to the races above would be to add a per-cpu variable such ->iowait_sleeptime_user which shadows ->iowait_sleeptime but is maintained in get_cpu_iowait_time_us() and kept monotonic, the former being the one we would export to user space. Another approach would be updating ->nr_iowait of the source and destination CPUs during task migration, but this may be overkill. What do you think? Thanks, Fernando ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] proc: Add workaround for idle/iowait decreasing problem. 2013-07-02 10:39 ` Fernando Luis Vazquez Cao @ 2013-08-07 0:58 ` Frederic Weisbecker 0 siblings, 0 replies; 6+ messages in thread From: Frederic Weisbecker @ 2013-08-07 0:58 UTC (permalink / raw) To: Fernando Luis Vazquez Cao Cc: Tetsuo Handa, tglx, linux-kernel, linux-fsdevel, Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven On Tue, Jul 02, 2013 at 07:39:08PM +0900, Fernando Luis Vazquez Cao wrote: > On 2013年07月02日 12:56, Fernando Luis Vazquez Cao wrote: > >Hi Frederic, > > > >I'm sorry it's taken me so long to respond; I got sidetracked for > >a while. Comments follow below. > > > >On 2013/04/28 09:49, Frederic Weisbecker wrote: > >>On Tue, Apr 23, 2013 at 09:45:23PM +0900, Tetsuo Handa wrote: > >>>CONFIG_NO_HZ=y can cause idle/iowait values to decrease. > >[...] > >>It's not clear in the changelog why you see non-monotonic > >>idle/iowait values. > >> > >>Looking at the previous patch from Fernando, it seems that's > >>because we can > >>race with concurrent updates from the CPU target when it wakes > >>up from idle? > >>(could be updated by drivers/cpufreq/cpufreq_governor.c as well). > >> > >>If so the bug has another symptom: we may also report a wrong > >>iowait/idle time > >>by accounting the last idle time twice. > >> > >>In this case we should fix the bug from the source, for example > >>we can force > >>the given ordering: > >> > >>= Write side = = Read side = > >> > >>// tick_nohz_start_idle() > >>write_seqcount_begin(ts->seq) > >>ts->idle_entrytime = now > >>ts->idle_active = 1 > >>write_seqcount_end(ts->seq) > >> > >>// tick_nohz_stop_idle() > >>write_seqcount_begin(ts->seq) > >>ts->iowait_sleeptime += now - ts->idle_entrytime > >>t->idle_active = 0 > >>write_seqcount_end(ts->seq) > >> > >> // get_cpu_iowait_time_us() > >> do { > >> seq = > >>read_seqcount_begin(ts->seq) > >> if (t->idle_active) { > >> time = now - > >>ts->idle_entrytime > >> time += > >>ts->iowait_sleeptime > >> } else { > >> time = > >>ts->iowait_sleeptime > >> } > >> } while > >>(read_seqcount_retry(ts->seq, seq)); > >> > >>Right? seqcount should be enough to make sure we are getting a > >>consistent result. > >>I doubt we need harder locking. > > > >I tried that and it doesn't suffice. The problem that causes the most > >serious skews is related to the CPU scheduler: the per-run queue > >counter nr_iowait can be updated not only from the CPU it belongs > >to but also from any other CPU if tasks are migrated out while > >waiting on I/O. > > > >The race looks like this: > > > >CPU0 CPU1 > > [ CPU1_rq->nr_iowait == 0 ] > > Task foo: io_schedule() > > schedule() > > [ CPU1_rq->nr_iowait == 1) ] > > Task foo migrated to CPU0 > > Goes to sleep > > > >// get_cpu_iowait_time_us(1, NULL) > >[ CPU1_ts->idle_active == 1, CPU1_rq->nr_iowait == 1 ] > >[ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 3 ] > >now = 5 > >delta = 5 - 3 = 2 > >iowait = 4 + 2 = 6 > > > >Task foo wakes up > >[ CPU1_rq->nr_iowait == 0 ] > > > > CPU1 comes out of sleep state > > tick_nohz_stop_idle() > > update_ts_time_stats() > > [ CPU1_ts->idle_active == 1, > >CPU1_rq->nr_iowait == 0 ] > > [ CPU1_ts->iowait_sleeptime = > >4, CPU1_ts->idle_entrytime = 3 ] > > now = 6 > > delta = 6 - 3 = 3 > > (CPU1_ts->iowait_sleeptime is > >not updated) > > CPU1_ts->idle_entrytime = now = 6 > > CPU1_ts->idle_active = 0 > > > >// get_cpu_iowait_time_us(1, NULL) > >[ CPU1_ts->idle_active == 0, CPU1_rq->nr_iowait == 0 ] > >[ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 6 ] > >iowait = CPU1_ts->iowait_sleeptime = 4 > >(iowait decreased from 6 to 4) > > A possible solution to the races above would be to add > a per-cpu variable such ->iowait_sleeptime_user which > shadows ->iowait_sleeptime but is maintained in > get_cpu_iowait_time_us() and kept monotonic, > the former being the one we would export to user > space. > > Another approach would be updating ->nr_iowait > of the source and destination CPUs during task > migration, but this may be overkill. > > What do you think? I have the feeling we can fix that with: * only update ts->idle_sleeptime / ts->iowait_sleeptime locally from tick_nohz_start_idle() and tick_nohz_stop_idle() * readers can add the pending delta to these values anytime they fetch it * use seqcount to ensure that ts->idle_entrytime, ts->iowait/idle_sleeptime update sequences are well synchronized. I just wrote the patches that do that. Let me just test them and write the changelogs then I'll post that tomorrow. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] proc: Add workaround for idle/iowait decreasing problem. 2013-07-02 3:56 ` Fernando Luis Vazquez Cao 2013-07-02 10:39 ` Fernando Luis Vazquez Cao @ 2013-08-07 0:12 ` Frederic Weisbecker 1 sibling, 0 replies; 6+ messages in thread From: Frederic Weisbecker @ 2013-08-07 0:12 UTC (permalink / raw) To: Fernando Luis Vazquez Cao Cc: Tetsuo Handa, tglx, linux-kernel, linux-fsdevel, Ingo Molnar, Peter Zijlstra, Andrew Morton, Arjan van de Ven On Tue, Jul 02, 2013 at 12:56:04PM +0900, Fernando Luis Vazquez Cao wrote: > Hi Frederic, > > I'm sorry it's taken me so long to respond; I got sidetracked for > a while. Comments follow below. > > On 2013/04/28 09:49, Frederic Weisbecker wrote: > >On Tue, Apr 23, 2013 at 09:45:23PM +0900, Tetsuo Handa wrote: > >>CONFIG_NO_HZ=y can cause idle/iowait values to decrease. > [...] > >It's not clear in the changelog why you see non-monotonic idle/iowait values. > > > >Looking at the previous patch from Fernando, it seems that's because we can > >race with concurrent updates from the CPU target when it wakes up from idle? > >(could be updated by drivers/cpufreq/cpufreq_governor.c as well). > > > >If so the bug has another symptom: we may also report a wrong iowait/idle time > >by accounting the last idle time twice. > > > >In this case we should fix the bug from the source, for example we can force > >the given ordering: > > > >= Write side = = Read side = > > > >// tick_nohz_start_idle() > >write_seqcount_begin(ts->seq) > >ts->idle_entrytime = now > >ts->idle_active = 1 > >write_seqcount_end(ts->seq) > > > >// tick_nohz_stop_idle() > >write_seqcount_begin(ts->seq) > >ts->iowait_sleeptime += now - ts->idle_entrytime > >t->idle_active = 0 > >write_seqcount_end(ts->seq) > > > > // get_cpu_iowait_time_us() > > do { > > seq = read_seqcount_begin(ts->seq) > > if (t->idle_active) { > > time = now - ts->idle_entrytime > > time += ts->iowait_sleeptime > > } else { > > time = ts->iowait_sleeptime > > } > > } while (read_seqcount_retry(ts->seq, seq)); > > > >Right? seqcount should be enough to make sure we are getting a consistent result. > >I doubt we need harder locking. > > I tried that and it doesn't suffice. The problem that causes the most > serious skews is related to the CPU scheduler: the per-run queue > counter nr_iowait can be updated not only from the CPU it belongs > to but also from any other CPU if tasks are migrated out while > waiting on I/O. > > The race looks like this: > > CPU0 CPU1 > [ CPU1_rq->nr_iowait == 0 ] > Task foo: io_schedule() > schedule() > [ CPU1_rq->nr_iowait == 1) ] > Task foo migrated to CPU0 > Goes to sleep > > // get_cpu_iowait_time_us(1, NULL) > [ CPU1_ts->idle_active == 1, CPU1_rq->nr_iowait == 1 ] > [ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 3 ] > now = 5 > delta = 5 - 3 = 2 > iowait = 4 + 2 = 6 > > Task foo wakes up > [ CPU1_rq->nr_iowait == 0 ] > > CPU1 comes out of sleep state > tick_nohz_stop_idle() > update_ts_time_stats() > [ CPU1_ts->idle_active == 1, CPU1_rq->nr_iowait == 0 ] > [ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 3 ] > now = 6 > delta = 6 - 3 = 3 > (CPU1_ts->iowait_sleeptime is not updated) > CPU1_ts->idle_entrytime = now = 6 > CPU1_ts->idle_active = 0 > > // get_cpu_iowait_time_us(1, NULL) > [ CPU1_ts->idle_active == 0, CPU1_rq->nr_iowait == 0 ] > [ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 6 ] > iowait = CPU1_ts->iowait_sleeptime = 4 > (iowait decreased from 6 to 4) Yeah, that's why we need to allow updates of ts->idle/iowait_sleeptime only from the local CPU when it exits idle. > > > >Another thing while at it. It seems that an update done from drivers/cpufreq/cpufreq_governor.c > >(calling get_cpu_iowait_time_us() -> update_ts_time_stats()) can randomly race with a CPU > >entering/exiting idle. I have no idea why drivers/cpufreq/cpufreq_governor.c does the update > >itself. It can just compute the delta like any reader. May be we could remove that and only > >ever call update_ts_time_stats() from the CPU that exit idle. > > > >What do you think? > > I am all for it. We just need to make sure that CPU governors > can cope with non-monotonic idle and iowait times. I'll take > a closer look at the code but I wouldn't mind if Arjan (CCed) > beat me at that. I'm not sure what you mean. Only allowing the update from local idle exit won't break monotonicity. I'll try to write some patches about that. > > Thanks, > Fernando ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-08-07 0:58 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <201301152014.AAD52192.FOOHQVtSFMFOJL@I-love.SAKURA.ne.jp> [not found] ` <alpine.LFD.2.02.1301151313170.7475@ionos> [not found] ` <201301180857.r0I8vK7c052791@www262.sakura.ne.jp> [not found] ` <1363660703.4993.3.camel@nexus> [not found] ` <201304012205.DFC60784.HVOMJSFFLFtOOQ@I-love.SAKURA.ne.jp> 2013-04-23 12:45 ` [PATCH] proc: Add workaround for idle/iowait decreasing problem Tetsuo Handa 2013-04-28 0:49 ` Frederic Weisbecker 2013-07-02 3:56 ` Fernando Luis Vazquez Cao 2013-07-02 10:39 ` Fernando Luis Vazquez Cao 2013-08-07 0:58 ` Frederic Weisbecker 2013-08-07 0:12 ` Frederic Weisbecker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).