From mboxrd@z Thu Jan 1 00:00:00 1970 From: Fernando Luis Vazquez Cao Subject: Re: [PATCH] proc: Add workaround for idle/iowait decreasing problem. Date: Tue, 02 Jul 2013 19:39:08 +0900 Message-ID: <51D2ADCC.1090807@lab.ntt.co.jp> References: <201301152014.AAD52192.FOOHQVtSFMFOJL@I-love.SAKURA.ne.jp> <201301180857.r0I8vK7c052791@www262.sakura.ne.jp> <1363660703.4993.3.camel@nexus> <201304012205.DFC60784.HVOMJSFFLFtOOQ@I-love.SAKURA.ne.jp> <201304232145.AHE52181.HJVOOQSFLMFOtF@I-love.SAKURA.ne.jp> <20130428004940.GA10354@somewhere> <51D24F54.1000703@lab.ntt.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Tetsuo Handa , tglx@linutronix.de, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Ingo Molnar , Peter Zijlstra , Andrew Morton , Arjan van de Ven To: Frederic Weisbecker Return-path: In-Reply-To: <51D24F54.1000703@lab.ntt.co.jp> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On 2013=E5=B9=B407=E6=9C=8802=E6=97=A5 12:56, Fernando Luis Vazquez Cao= wrote: > Hi Frederic, > > I'm sorry it's taken me so long to respond; I got sidetracked for > a while. Comments follow below. > > On 2013/04/28 09:49, Frederic Weisbecker wrote: >> On Tue, Apr 23, 2013 at 09:45:23PM +0900, Tetsuo Handa wrote: >>> CONFIG_NO_HZ=3Dy can cause idle/iowait values to decrease. > [...] >> It's not clear in the changelog why you see non-monotonic idle/iowai= t=20 >> values. >> >> Looking at the previous patch from Fernando, it seems that's because= =20 >> we can >> race with concurrent updates from the CPU target when it wakes up=20 >> from idle? >> (could be updated by drivers/cpufreq/cpufreq_governor.c as well). >> >> If so the bug has another symptom: we may also report a wrong=20 >> iowait/idle time >> by accounting the last idle time twice. >> >> In this case we should fix the bug from the source, for example we=20 >> can force >> the given ordering: >> >> =3D Write side =3D =3D Read side =3D >> >> // tick_nohz_start_idle() >> write_seqcount_begin(ts->seq) >> ts->idle_entrytime =3D now >> ts->idle_active =3D 1 >> write_seqcount_end(ts->seq) >> >> // tick_nohz_stop_idle() >> write_seqcount_begin(ts->seq) >> ts->iowait_sleeptime +=3D now - ts->idle_entrytime >> t->idle_active =3D 0 >> write_seqcount_end(ts->seq) >> >> // get_cpu_iowait_time_us() >> do { >> seq =3D=20 >> read_seqcount_begin(ts->seq) >> if (t->idle_active) { >> time =3D now -=20 >> ts->idle_entrytime >> time +=3D=20 >> ts->iowait_sleeptime >> } else { >> time =3D=20 >> ts->iowait_sleeptime >> } >> } while=20 >> (read_seqcount_retry(ts->seq, seq)); >> >> Right? seqcount should be enough to make sure we are getting a=20 >> consistent result. >> I doubt we need harder locking. > > I tried that and it doesn't suffice. The problem that causes the most > serious skews is related to the CPU scheduler: the per-run queue > counter nr_iowait can be updated not only from the CPU it belongs > to but also from any other CPU if tasks are migrated out while > waiting on I/O. > > The race looks like this: > > CPU0 CPU1 > [ CPU1_rq->nr_iowait =3D=3D 0 ] > Task foo: io_schedule() > schedule() > [ CPU1_rq->nr_iowait =3D=3D 1) ] > Task foo migrated to CPU0 > Goes to sleep > > // get_cpu_iowait_time_us(1, NULL) > [ CPU1_ts->idle_active =3D=3D 1, CPU1_rq->nr_iowait =3D=3D 1 ] > [ CPU1_ts->iowait_sleeptime =3D 4, CPU1_ts->idle_entrytime =3D 3 ] > now =3D 5 > delta =3D 5 - 3 =3D 2 > iowait =3D 4 + 2 =3D 6 > > Task foo wakes up > [ CPU1_rq->nr_iowait =3D=3D 0 ] > > CPU1 comes out of sleep state > tick_nohz_stop_idle() > update_ts_time_stats() > [ CPU1_ts->idle_active =3D=3D 1,=20 > CPU1_rq->nr_iowait =3D=3D 0 ] > [ CPU1_ts->iowait_sleeptime =3D 4= ,=20 > CPU1_ts->idle_entrytime =3D 3 ] > now =3D 6 > delta =3D 6 - 3 =3D 3 > (CPU1_ts->iowait_sleeptime is not= =20 > updated) > CPU1_ts->idle_entrytime =3D now =3D= 6 > CPU1_ts->idle_active =3D 0 > > // get_cpu_iowait_time_us(1, NULL) > [ CPU1_ts->idle_active =3D=3D 0, CPU1_rq->nr_iowait =3D=3D 0 ] > [ CPU1_ts->iowait_sleeptime =3D 4, CPU1_ts->idle_entrytime =3D 6 ] > iowait =3D CPU1_ts->iowait_sleeptime =3D 4 > (iowait decreased from 6 to 4) A possible solution to the races above would be to add a per-cpu variable such ->iowait_sleeptime_user which shadows ->iowait_sleeptime but is maintained in get_cpu_iowait_time_us() and kept monotonic, the former being the one we would export to user space. Another approach would be updating ->nr_iowait of the source and destination CPUs during task migration, but this may be overkill. What do you think? Thanks, =46ernando