From: Frederic Weisbecker <fweisbec@gmail.com>
To: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
tglx@linutronix.de, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, Ingo Molnar <mingo@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
Arjan van de Ven <arjan@linux.intel.com>
Subject: Re: [PATCH] proc: Add workaround for idle/iowait decreasing problem.
Date: Wed, 7 Aug 2013 02:58:39 +0200 [thread overview]
Message-ID: <20130807005838.GB3011@somewhere> (raw)
In-Reply-To: <51D2ADCC.1090807@lab.ntt.co.jp>
On Tue, Jul 02, 2013 at 07:39:08PM +0900, Fernando Luis Vazquez Cao wrote:
> On 2013年07月02日 12:56, Fernando Luis Vazquez Cao wrote:
> >Hi Frederic,
> >
> >I'm sorry it's taken me so long to respond; I got sidetracked for
> >a while. Comments follow below.
> >
> >On 2013/04/28 09:49, Frederic Weisbecker wrote:
> >>On Tue, Apr 23, 2013 at 09:45:23PM +0900, Tetsuo Handa wrote:
> >>>CONFIG_NO_HZ=y can cause idle/iowait values to decrease.
> >[...]
> >>It's not clear in the changelog why you see non-monotonic
> >>idle/iowait values.
> >>
> >>Looking at the previous patch from Fernando, it seems that's
> >>because we can
> >>race with concurrent updates from the CPU target when it wakes
> >>up from idle?
> >>(could be updated by drivers/cpufreq/cpufreq_governor.c as well).
> >>
> >>If so the bug has another symptom: we may also report a wrong
> >>iowait/idle time
> >>by accounting the last idle time twice.
> >>
> >>In this case we should fix the bug from the source, for example
> >>we can force
> >>the given ordering:
> >>
> >>= Write side = = Read side =
> >>
> >>// tick_nohz_start_idle()
> >>write_seqcount_begin(ts->seq)
> >>ts->idle_entrytime = now
> >>ts->idle_active = 1
> >>write_seqcount_end(ts->seq)
> >>
> >>// tick_nohz_stop_idle()
> >>write_seqcount_begin(ts->seq)
> >>ts->iowait_sleeptime += now - ts->idle_entrytime
> >>t->idle_active = 0
> >>write_seqcount_end(ts->seq)
> >>
> >> // get_cpu_iowait_time_us()
> >> do {
> >> seq =
> >>read_seqcount_begin(ts->seq)
> >> if (t->idle_active) {
> >> time = now -
> >>ts->idle_entrytime
> >> time +=
> >>ts->iowait_sleeptime
> >> } else {
> >> time =
> >>ts->iowait_sleeptime
> >> }
> >> } while
> >>(read_seqcount_retry(ts->seq, seq));
> >>
> >>Right? seqcount should be enough to make sure we are getting a
> >>consistent result.
> >>I doubt we need harder locking.
> >
> >I tried that and it doesn't suffice. The problem that causes the most
> >serious skews is related to the CPU scheduler: the per-run queue
> >counter nr_iowait can be updated not only from the CPU it belongs
> >to but also from any other CPU if tasks are migrated out while
> >waiting on I/O.
> >
> >The race looks like this:
> >
> >CPU0 CPU1
> > [ CPU1_rq->nr_iowait == 0 ]
> > Task foo: io_schedule()
> > schedule()
> > [ CPU1_rq->nr_iowait == 1) ]
> > Task foo migrated to CPU0
> > Goes to sleep
> >
> >// get_cpu_iowait_time_us(1, NULL)
> >[ CPU1_ts->idle_active == 1, CPU1_rq->nr_iowait == 1 ]
> >[ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 3 ]
> >now = 5
> >delta = 5 - 3 = 2
> >iowait = 4 + 2 = 6
> >
> >Task foo wakes up
> >[ CPU1_rq->nr_iowait == 0 ]
> >
> > CPU1 comes out of sleep state
> > tick_nohz_stop_idle()
> > update_ts_time_stats()
> > [ CPU1_ts->idle_active == 1,
> >CPU1_rq->nr_iowait == 0 ]
> > [ CPU1_ts->iowait_sleeptime =
> >4, CPU1_ts->idle_entrytime = 3 ]
> > now = 6
> > delta = 6 - 3 = 3
> > (CPU1_ts->iowait_sleeptime is
> >not updated)
> > CPU1_ts->idle_entrytime = now = 6
> > CPU1_ts->idle_active = 0
> >
> >// get_cpu_iowait_time_us(1, NULL)
> >[ CPU1_ts->idle_active == 0, CPU1_rq->nr_iowait == 0 ]
> >[ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 6 ]
> >iowait = CPU1_ts->iowait_sleeptime = 4
> >(iowait decreased from 6 to 4)
>
> A possible solution to the races above would be to add
> a per-cpu variable such ->iowait_sleeptime_user which
> shadows ->iowait_sleeptime but is maintained in
> get_cpu_iowait_time_us() and kept monotonic,
> the former being the one we would export to user
> space.
>
> Another approach would be updating ->nr_iowait
> of the source and destination CPUs during task
> migration, but this may be overkill.
>
> What do you think?
I have the feeling we can fix that with:
* only update ts->idle_sleeptime / ts->iowait_sleeptime locally
from tick_nohz_start_idle() and tick_nohz_stop_idle()
* readers can add the pending delta to these values anytime they fetch it
* use seqcount to ensure that ts->idle_entrytime, ts->iowait/idle_sleeptime update
sequences are well synchronized.
I just wrote the patches that do that. Let me just test them and write the changelogs
then I'll post that tomorrow.
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
WARNING: multiple messages have this Message-ID (diff)
From: Frederic Weisbecker <fweisbec@gmail.com>
To: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
tglx@linutronix.de, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, Ingo Molnar <mingo@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
Arjan van de Ven <arjan@linux.intel.com>
Subject: Re: [PATCH] proc: Add workaround for idle/iowait decreasing problem.
Date: Wed, 7 Aug 2013 02:58:39 +0200 [thread overview]
Message-ID: <20130807005838.GB3011@somewhere> (raw)
In-Reply-To: <51D2ADCC.1090807@lab.ntt.co.jp>
On Tue, Jul 02, 2013 at 07:39:08PM +0900, Fernando Luis Vazquez Cao wrote:
> On 2013年07月02日 12:56, Fernando Luis Vazquez Cao wrote:
> >Hi Frederic,
> >
> >I'm sorry it's taken me so long to respond; I got sidetracked for
> >a while. Comments follow below.
> >
> >On 2013/04/28 09:49, Frederic Weisbecker wrote:
> >>On Tue, Apr 23, 2013 at 09:45:23PM +0900, Tetsuo Handa wrote:
> >>>CONFIG_NO_HZ=y can cause idle/iowait values to decrease.
> >[...]
> >>It's not clear in the changelog why you see non-monotonic
> >>idle/iowait values.
> >>
> >>Looking at the previous patch from Fernando, it seems that's
> >>because we can
> >>race with concurrent updates from the CPU target when it wakes
> >>up from idle?
> >>(could be updated by drivers/cpufreq/cpufreq_governor.c as well).
> >>
> >>If so the bug has another symptom: we may also report a wrong
> >>iowait/idle time
> >>by accounting the last idle time twice.
> >>
> >>In this case we should fix the bug from the source, for example
> >>we can force
> >>the given ordering:
> >>
> >>= Write side = = Read side =
> >>
> >>// tick_nohz_start_idle()
> >>write_seqcount_begin(ts->seq)
> >>ts->idle_entrytime = now
> >>ts->idle_active = 1
> >>write_seqcount_end(ts->seq)
> >>
> >>// tick_nohz_stop_idle()
> >>write_seqcount_begin(ts->seq)
> >>ts->iowait_sleeptime += now - ts->idle_entrytime
> >>t->idle_active = 0
> >>write_seqcount_end(ts->seq)
> >>
> >> // get_cpu_iowait_time_us()
> >> do {
> >> seq =
> >>read_seqcount_begin(ts->seq)
> >> if (t->idle_active) {
> >> time = now -
> >>ts->idle_entrytime
> >> time +=
> >>ts->iowait_sleeptime
> >> } else {
> >> time =
> >>ts->iowait_sleeptime
> >> }
> >> } while
> >>(read_seqcount_retry(ts->seq, seq));
> >>
> >>Right? seqcount should be enough to make sure we are getting a
> >>consistent result.
> >>I doubt we need harder locking.
> >
> >I tried that and it doesn't suffice. The problem that causes the most
> >serious skews is related to the CPU scheduler: the per-run queue
> >counter nr_iowait can be updated not only from the CPU it belongs
> >to but also from any other CPU if tasks are migrated out while
> >waiting on I/O.
> >
> >The race looks like this:
> >
> >CPU0 CPU1
> > [ CPU1_rq->nr_iowait == 0 ]
> > Task foo: io_schedule()
> > schedule()
> > [ CPU1_rq->nr_iowait == 1) ]
> > Task foo migrated to CPU0
> > Goes to sleep
> >
> >// get_cpu_iowait_time_us(1, NULL)
> >[ CPU1_ts->idle_active == 1, CPU1_rq->nr_iowait == 1 ]
> >[ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 3 ]
> >now = 5
> >delta = 5 - 3 = 2
> >iowait = 4 + 2 = 6
> >
> >Task foo wakes up
> >[ CPU1_rq->nr_iowait == 0 ]
> >
> > CPU1 comes out of sleep state
> > tick_nohz_stop_idle()
> > update_ts_time_stats()
> > [ CPU1_ts->idle_active == 1,
> >CPU1_rq->nr_iowait == 0 ]
> > [ CPU1_ts->iowait_sleeptime =
> >4, CPU1_ts->idle_entrytime = 3 ]
> > now = 6
> > delta = 6 - 3 = 3
> > (CPU1_ts->iowait_sleeptime is
> >not updated)
> > CPU1_ts->idle_entrytime = now = 6
> > CPU1_ts->idle_active = 0
> >
> >// get_cpu_iowait_time_us(1, NULL)
> >[ CPU1_ts->idle_active == 0, CPU1_rq->nr_iowait == 0 ]
> >[ CPU1_ts->iowait_sleeptime = 4, CPU1_ts->idle_entrytime = 6 ]
> >iowait = CPU1_ts->iowait_sleeptime = 4
> >(iowait decreased from 6 to 4)
>
> A possible solution to the races above would be to add
> a per-cpu variable such ->iowait_sleeptime_user which
> shadows ->iowait_sleeptime but is maintained in
> get_cpu_iowait_time_us() and kept monotonic,
> the former being the one we would export to user
> space.
>
> Another approach would be updating ->nr_iowait
> of the source and destination CPUs during task
> migration, but this may be overkill.
>
> What do you think?
I have the feeling we can fix that with:
* only update ts->idle_sleeptime / ts->iowait_sleeptime locally
from tick_nohz_start_idle() and tick_nohz_stop_idle()
* readers can add the pending delta to these values anytime they fetch it
* use seqcount to ensure that ts->idle_entrytime, ts->iowait/idle_sleeptime update
sequences are well synchronized.
I just wrote the patches that do that. Let me just test them and write the changelogs
then I'll post that tomorrow.
Thanks.
next prev parent reply other threads:[~2013-08-07 0:58 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <201301152014.AAD52192.FOOHQVtSFMFOJL@I-love.SAKURA.ne.jp>
[not found] ` <alpine.LFD.2.02.1301151313170.7475@ionos>
[not found] ` <201301180857.r0I8vK7c052791@www262.sakura.ne.jp>
2013-03-19 2:38 ` [RFC] iowait/idle time accounting hiccups in NOHZ kernels Fernando Luis Vázquez Cao
2013-04-01 13:05 ` Tetsuo Handa
2013-04-23 12:45 ` [PATCH] proc: Add workaround for idle/iowait decreasing problem Tetsuo Handa
2013-04-28 0:49 ` Frederic Weisbecker
2013-07-02 3:56 ` Fernando Luis Vazquez Cao
2013-07-02 10:39 ` Fernando Luis Vazquez Cao
2013-08-07 0:58 ` Frederic Weisbecker [this message]
2013-08-07 0:58 ` Frederic Weisbecker
2013-08-07 0:12 ` Frederic Weisbecker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130807005838.GB3011@somewhere \
--to=fweisbec@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=arjan@linux.intel.com \
--cc=fernando_b1@lab.ntt.co.jp \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=penguin-kernel@I-love.SAKURA.ne.jp \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.