All of lore.kernel.org
 help / color / mirror / Atom feed
From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
To: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@ozlabs.org
Subject: Re: [PATCH 1/2] tickless idle cpus: core patch - v2
Date: Mon, 24 Apr 2006 21:09:53 +0530	[thread overview]
Message-ID: <20060424153953.GA4934@in.ibm.com> (raw)
In-Reply-To: <17480.47308.126194.26202@cargo.ozlabs.ibm.com>

Paul,
	At the outset I am slightly confused on whether you are
proposing to allow boot cpus to skip ticks or not. A clarification on
this will help!

On Fri, Apr 21, 2006 at 08:49:48PM +1000, Paul Mackerras wrote:
> Is the nohz_cpu_mask accessed by other cpus?  I wonder if using a

Yes, nohz_cpu_mask can be accessed by other CPUs, as part of RCU batch
processing.

> 1-byte per-cpu variable for this, or even a bit in the thread_info
> struct, might be better because it would reduce the number of atomic
> bit set/clear operations we have to do.

Agreed that updating a global bitmask is not a scalable thing to do. But
unfortunately this is tied to RCU implementation ATM. I know that Dipankar has 
plans to get rid of nohz_cpu_mask in the RCU implementation. When that happens, 
maybe we can revist this.

> > +#define MAX_DEC_COUNT	UINT_MAX	/* Decrementer is 32-bit */
> 
> The decrementer value should be restricted to INT_MAX.  I think some
> implementations will take a decrementer interrupt if you set the
> decrementer to a negative value.

Ok. Thanks for pointing it out.

> > +static void account_ticks(struct pt_regs *regs)
> > +{
> > +	int next_dec;
> > +	int cpu = smp_processor_id();
> > +	unsigned long ticks;
> > +
> >  	while ((ticks = tb_ticks_since(per_cpu(last_jiffy, cpu)))
> >  	       >= tb_ticks_per_jiffy) {
> >  		/* Update last_jiffy */
> 
> This is just the loop body from timer_interrupt, right?  Why do we

Correct. Note that this loop body is common to both timer_interrupt and 
start_hz_timer currently.

> have to loop around N times after we skipped N ticks?  What we're
> doing inside the loop is:
> 
> - account_process_time, but we know we were in the idle task, so the
>   time is just idle time.  If we have the accurate accounting stuff
>   turned on the first call to account_process_time will account all
>   the idle time anyway.

Do you mean to say that "the first call to account_system_vtime will account 
all the idle time"? AFAICS account_process_time (->account_process_vtime) is
accounting just usertime.

I had a related question: if we put the idle CPU to some power-saving
state in the period that it is tickless, would its PURR value keep
getting incremented? If not, would that affect the idle time we
calculate after skipping N ticks (in case accurate acct is turned ON
that is)?

> - we're not skipping ticks on the boot cpu (yet), so we won't do the
>   do_timer and timer_recalc_offset calls.

Don't follow you here. I thought we wanted to let the boot cpu to skip
ticks too (as you seem to indicate down below - while talking of
updating xtime/NTP)? If we want to allow that, then we need to be able to
call do_timer with a regs argument?

> I think we could probably rearrange this code to eliminate the need
> for the regs argument - all it's used for is telling whether we were
> in user mode, and we know we weren't since we were in the idle task.
> That would mean that we maybe could call start_hz_timer from the idle
> task body instead of inside do_IRQ etc.  The only thing we'd have to
> watch out for is updating the decrementer if some interrupt handler
> called add_timer/mod_timer etc.

One problem in deferring the call to start_hz_timer like this is RCU. Ideally
we want to clear the tickless idle-CPU from nohz_cpu_mask -as soon- as it
starts processing an interrupt. Otherwise RCU will think that the CPU is in
tickless state (hence not accessing RCU-protected data structures) while the 
idle-CPU is processing interrupts/softirqs and will not include the CPU in its
grace-period processing. This may be a bad thing to allow. 

The other problem in deferring a call to start_hz_timer is we could cause them 
to be running with an incorrect jiffy value (can happen if all cpus were
skipping ticks for a period).

IMO we should let start_hz_timer be called as it is being called now, i.e
immediately when a tickless idle-CPU wakes up. This would also
unfortunately mean that we need to embed calls to start_hz_timer from all
the interrupt paths (do_IRQ, performance_monitor_exception etc) where we
come out of tickless state.

> Assuming we make the changes we have discussed to reduce the skewing
> of the decrementer interrupts quite small, and allow any cpu to call
> do_timer, then how where you thinking that xtime and the NTP stuff
> would get updated, in the situation where all cpus are skipping ticks?
> By doing N calls to do_timer in a row?  Or by adding N-1 to jiffies_64
> and then calling do_timer once?

I assume this implies we want to let boot-cpu to skip ticks?

Ideally it would be good if we add N-1 to jiffies_64 and call do_timer
once. But I wonder if it can be done w/o being racy. In order to calculate
N, we may calculate the difference, delta, between current timebase and
tb_last_stamp as:

	delta = tb_ticks_since(tb_last_stamp);	-> Step 1

We may then calculate N as:

	N = delta / tb_ticks_per_jiffy;		-> Step 2

However this will be racy, if we happen to sample timebase (Step 1) just
before a jiffy boundary. In which case, we would have missed to update
jiffy by 1? To avoid this, we may need to check for the corner case and
introduce a 'goto retry' loop?

BTW, I realized that my patch to let boot cpu to skip ticks also is
slightly buggy here:

                if (cpu != do_timer_cpu)
                       continue;

                write_seqlock(&xtime_lock);
		...
                do_timer(regs);
		...
                write_sequnlock(&xtime_lock);

There needs to a different while loop to call do_timer based on 
tb_ticks_since(tb_last_stamp). Will fix it next time around.

> 
> > +#ifdef CONFIG_NO_IDLE_HZ
> > +	max_skip = __USE_RTC() ? HZ : MAX_DEC_COUNT / tb_ticks_per_jiffy;
> > +#endif
> 
> If we allow the boot cpu to skip ticks, then we will have to make sure
> that at least one cpu wakes up in time to do the updating in
> recalc_timer_offset.

Good point. However is this critical, especially if all cpus had been
idle for quite a while? As I understand, if we don't wakeup in time to
do timer_recalc_offset, the only drawback is the first gettimeofday will
be slow (because it has make a system call)?



-- 
Regards,
vatsa

  reply	other threads:[~2006-04-24 15:41 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-04-07  6:30 [PATCH 1/4] tickless idle cpu - Allow any CPU to update jiffies Srivatsa Vaddagiri
2006-04-07 23:04 ` Paul Mackerras
2006-04-10 11:49   ` Srivatsa Vaddagiri
2006-04-10 12:18   ` [PATCH 1/2] tickless idle cpus: core patch - v2 Srivatsa Vaddagiri
2006-04-11 17:35     ` Paul Mackerras
2006-04-12  4:50       ` Srivatsa Vaddagiri
2006-04-21 10:49     ` Paul Mackerras
2006-04-24 15:39       ` Srivatsa Vaddagiri [this message]
2006-04-10 12:19   ` [PATCH 2/2] tickless idle cpus: allow boot cpu to skip ticks Srivatsa Vaddagiri

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060424153953.GA4934@in.ibm.com \
    --to=vatsa@in.ibm.com \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=paulus@samba.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.