From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
To: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@ozlabs.org
Subject: Re: [PATCH 1/2] tickless idle cpus: core patch - v2
Date: Mon, 24 Apr 2006 21:09:53 +0530 [thread overview]
Message-ID: <20060424153953.GA4934@in.ibm.com> (raw)
In-Reply-To: <17480.47308.126194.26202@cargo.ozlabs.ibm.com>
Paul,
At the outset I am slightly confused on whether you are
proposing to allow boot cpus to skip ticks or not. A clarification on
this will help!
On Fri, Apr 21, 2006 at 08:49:48PM +1000, Paul Mackerras wrote:
> Is the nohz_cpu_mask accessed by other cpus? I wonder if using a
Yes, nohz_cpu_mask can be accessed by other CPUs, as part of RCU batch
processing.
> 1-byte per-cpu variable for this, or even a bit in the thread_info
> struct, might be better because it would reduce the number of atomic
> bit set/clear operations we have to do.
Agreed that updating a global bitmask is not a scalable thing to do. But
unfortunately this is tied to RCU implementation ATM. I know that Dipankar has
plans to get rid of nohz_cpu_mask in the RCU implementation. When that happens,
maybe we can revist this.
> > +#define MAX_DEC_COUNT UINT_MAX /* Decrementer is 32-bit */
>
> The decrementer value should be restricted to INT_MAX. I think some
> implementations will take a decrementer interrupt if you set the
> decrementer to a negative value.
Ok. Thanks for pointing it out.
> > +static void account_ticks(struct pt_regs *regs)
> > +{
> > + int next_dec;
> > + int cpu = smp_processor_id();
> > + unsigned long ticks;
> > +
> > while ((ticks = tb_ticks_since(per_cpu(last_jiffy, cpu)))
> > >= tb_ticks_per_jiffy) {
> > /* Update last_jiffy */
>
> This is just the loop body from timer_interrupt, right? Why do we
Correct. Note that this loop body is common to both timer_interrupt and
start_hz_timer currently.
> have to loop around N times after we skipped N ticks? What we're
> doing inside the loop is:
>
> - account_process_time, but we know we were in the idle task, so the
> time is just idle time. If we have the accurate accounting stuff
> turned on the first call to account_process_time will account all
> the idle time anyway.
Do you mean to say that "the first call to account_system_vtime will account
all the idle time"? AFAICS account_process_time (->account_process_vtime) is
accounting just usertime.
I had a related question: if we put the idle CPU to some power-saving
state in the period that it is tickless, would its PURR value keep
getting incremented? If not, would that affect the idle time we
calculate after skipping N ticks (in case accurate acct is turned ON
that is)?
> - we're not skipping ticks on the boot cpu (yet), so we won't do the
> do_timer and timer_recalc_offset calls.
Don't follow you here. I thought we wanted to let the boot cpu to skip
ticks too (as you seem to indicate down below - while talking of
updating xtime/NTP)? If we want to allow that, then we need to be able to
call do_timer with a regs argument?
> I think we could probably rearrange this code to eliminate the need
> for the regs argument - all it's used for is telling whether we were
> in user mode, and we know we weren't since we were in the idle task.
> That would mean that we maybe could call start_hz_timer from the idle
> task body instead of inside do_IRQ etc. The only thing we'd have to
> watch out for is updating the decrementer if some interrupt handler
> called add_timer/mod_timer etc.
One problem in deferring the call to start_hz_timer like this is RCU. Ideally
we want to clear the tickless idle-CPU from nohz_cpu_mask -as soon- as it
starts processing an interrupt. Otherwise RCU will think that the CPU is in
tickless state (hence not accessing RCU-protected data structures) while the
idle-CPU is processing interrupts/softirqs and will not include the CPU in its
grace-period processing. This may be a bad thing to allow.
The other problem in deferring a call to start_hz_timer is we could cause them
to be running with an incorrect jiffy value (can happen if all cpus were
skipping ticks for a period).
IMO we should let start_hz_timer be called as it is being called now, i.e
immediately when a tickless idle-CPU wakes up. This would also
unfortunately mean that we need to embed calls to start_hz_timer from all
the interrupt paths (do_IRQ, performance_monitor_exception etc) where we
come out of tickless state.
> Assuming we make the changes we have discussed to reduce the skewing
> of the decrementer interrupts quite small, and allow any cpu to call
> do_timer, then how where you thinking that xtime and the NTP stuff
> would get updated, in the situation where all cpus are skipping ticks?
> By doing N calls to do_timer in a row? Or by adding N-1 to jiffies_64
> and then calling do_timer once?
I assume this implies we want to let boot-cpu to skip ticks?
Ideally it would be good if we add N-1 to jiffies_64 and call do_timer
once. But I wonder if it can be done w/o being racy. In order to calculate
N, we may calculate the difference, delta, between current timebase and
tb_last_stamp as:
delta = tb_ticks_since(tb_last_stamp); -> Step 1
We may then calculate N as:
N = delta / tb_ticks_per_jiffy; -> Step 2
However this will be racy, if we happen to sample timebase (Step 1) just
before a jiffy boundary. In which case, we would have missed to update
jiffy by 1? To avoid this, we may need to check for the corner case and
introduce a 'goto retry' loop?
BTW, I realized that my patch to let boot cpu to skip ticks also is
slightly buggy here:
if (cpu != do_timer_cpu)
continue;
write_seqlock(&xtime_lock);
...
do_timer(regs);
...
write_sequnlock(&xtime_lock);
There needs to a different while loop to call do_timer based on
tb_ticks_since(tb_last_stamp). Will fix it next time around.
>
> > +#ifdef CONFIG_NO_IDLE_HZ
> > + max_skip = __USE_RTC() ? HZ : MAX_DEC_COUNT / tb_ticks_per_jiffy;
> > +#endif
>
> If we allow the boot cpu to skip ticks, then we will have to make sure
> that at least one cpu wakes up in time to do the updating in
> recalc_timer_offset.
Good point. However is this critical, especially if all cpus had been
idle for quite a while? As I understand, if we don't wakeup in time to
do timer_recalc_offset, the only drawback is the first gettimeofday will
be slow (because it has make a system call)?
--
Regards,
vatsa
next prev parent reply other threads:[~2006-04-24 15:41 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-04-07 6:30 [PATCH 1/4] tickless idle cpu - Allow any CPU to update jiffies Srivatsa Vaddagiri
2006-04-07 23:04 ` Paul Mackerras
2006-04-10 11:49 ` Srivatsa Vaddagiri
2006-04-10 12:18 ` [PATCH 1/2] tickless idle cpus: core patch - v2 Srivatsa Vaddagiri
2006-04-11 17:35 ` Paul Mackerras
2006-04-12 4:50 ` Srivatsa Vaddagiri
2006-04-21 10:49 ` Paul Mackerras
2006-04-24 15:39 ` Srivatsa Vaddagiri [this message]
2006-04-10 12:19 ` [PATCH 2/2] tickless idle cpus: allow boot cpu to skip ticks Srivatsa Vaddagiri
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060424153953.GA4934@in.ibm.com \
--to=vatsa@in.ibm.com \
--cc=linuxppc-dev@ozlabs.org \
--cc=paulus@samba.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).