linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Haris Okanovic <haris.okanovic@ni.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users@vger.kernel.org, linux-kernel@vger.kernel.org,
	harisokn@gmail.com, bigeasy@linutronix.de,
	julia.cartwright@ni.com, gratian.crisan@ni.com,
	anna-maria@linutronix.de
Subject: Re: [PATCH v2] timers: Don't wake ktimersoftd on every tick
Date: Thu, 3 Aug 2017 16:04:18 -0500	[thread overview]
Message-ID: <1f71fbdb-810a-06b8-c652-d8b560f64e2a@ni.com> (raw)
In-Reply-To: <alpine.DEB.2.20.1707182206110.2425@nanos>

Thomas,

Apologies on the late response. I've been busy last few weeks.

On 07/18/2017 04:33 PM, Thomas Gleixner wrote:
> On Mon, 17 Jul 2017, Haris Okanovic wrote:
>> We recently upgraded from 4.1 to 4.6 and noticed a minor latency
>> regression caused by an additional thread wakeup (ktimersoftd) in
>> interrupt context on every tick. The wakeups are from
>> run_local_timers() raising TIMER_SOFTIRQ. Both TIMER and SCHED softirq
>> coalesced into one ksoftirqd wakeup prior to Sebastian's change to split
>> timers into their own thread.
>>
>> There's already logic in run_local_timers() to avoid some unnecessary
>> wakeups of ksoftirqd, but it doesn't seems to catch them all. In
>> particular, I've seen many unnecessary wakeups when jiffies increments
>> prior to run_local_timers().
>>
>> Change the way timers are collected per Julia and Thomas'
>> recommendation: Expired timers are now collected in interrupt context
>> and fired in ktimersoftd to avoid double-walk of `pending_map`.
>>
>> Collect expired timers in interrupt context to avoid overhead of waking
>> ktimersoftd on every tick. ktimersoftd now wakes only when one or more
>> timers are ready, which yields a minor reduction in small latency spikes.
>>
>> This is implemented by storing lists of expired timers in timer_base,
>> updated on each tick. Any addition to the lists wakes ktimersoftd
>> (softirq) to process those timers.
> 
> One thing which would be really good to have in the changelog is the
> overhead of that collection operation in hard irq context.
> 

Added testing note: Execution time of run_local_timers() increases by 
0.2us to 2.5us as measured by TSC on a 2core Intel Atom E3825 system.

I'm guessing the variance is spin lock contention caused by timers being 
added/removed by different threads.

I also verified the average case latency decrease in cyclictest and 
reran Anna-Maria's test on a qemu 4core Nehalem VM; latency decreases 
and no stalls.

>> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
>> index 5730d42bfd67..e5b537f2308c 100644
>> --- a/kernel/time/timer.c
>> +++ b/kernel/time/timer.c
>> @@ -209,9 +209,12 @@ struct timer_base {
>>   	bool			is_idle;
>>   	DECLARE_BITMAP(pending_map, WHEEL_SIZE);
>>   	struct hlist_head	vectors[WHEEL_SIZE];
>> +	struct hlist_head	expired_lists[LVL_DEPTH];
>> +	int			expired_count;
> 
> You need to look at the cache layout of that whole thing. My gut feeling
> tells me that that count is at the wrong place.
> 

You're right, there's a 4-byte hole after `lock` we can use. I'll move
`expired_count` there.

>>   } ____cacheline_aligned;
>>   
>>   static DEFINE_PER_CPU(struct timer_base, timer_bases[NR_BASES]);
>> +static DEFINE_PER_CPU(int, block_softirqs);
> 
> Why are you putting that into a seperate per cpu variable instead of adding
> a bool to the base struct as I suggested in my example:
> 
>         base->softirq_activated = false;
> 
> Having that separate makes no sense conceptually and cache wise it can
> force to touch yet another cacheline depending on the placement by
> compiler/linker. Looking at your implementation it does in 100% of the
> cases.
> 
> You can use the first base for that, as that is going to be touched anyway
> and is cache hot in any case.
> 

I was trying to avoid using twice as much memory in the NOHZ case and
didn't consider cache implications. There's actually another 1-byte hole
after `timer_base.is_idle` which can fit this bool.

>> -static void expire_timers(struct timer_base *base, struct hlist_head *head)
>> +static inline void __expire_timers(struct timer_base *base,
> 
> What's the purpose of this change? If it makes sense to inline it, then the
> compiler will do so.
> 

I inlined it because it only has one call site, but I'm sure the
compiler will figure that out as well. Dropped.

>> +				   struct hlist_head *head)
>>   {
>>   	while (!hlist_empty(head)) {
>>   		struct timer_list *timer;
>> @@ -1344,21 +1348,45 @@ static void expire_timers(struct timer_base *base, struct hlist_head *head)
>>   	}
>>   }
>>   
>> -static int __collect_expired_timers(struct timer_base *base,
>> -				    struct hlist_head *heads)
>> +static void expire_timers(struct timer_base *base)
>>   {
>> -	unsigned long clk = base->clk;
>> +	struct hlist_head *head;
>> +	int count = READ_ONCE(base->expired_count);
> 
> Please keep the reverse fir tree ordering based on length for the variables
> as we have it throughout that code.
> 

Moved clk.

>> +
>> +	while (count--) {
> 
> So this changed vs. the previous implementation and in this case the
> READ_ONCE() is pointless as the compiler CANNOT reevaluate base->foo inside
> that loop.
> 

Removed.

>> +		head = base->expired_lists + count;
>> +		__expire_timers(base, head);
>> +	}
>> +
>> +	/* Zero base->expired_count after processing all base->expired_lists
>> +	 * to signal it's ready to get re-populated. Otherwise, we race with
>> +	 * tick_find_expired() when base->lock is temporarily dropped in
>> +	 * __expire_timers() */
> 
> Please stick to the two sane comment styles:
> 
>         /* Single line comment */
> 
>         /*
>          * Multi line comment
>          * Multi line comment
>          * Multi line comment
> 	*/
> 
> For further enlightment: http://marc.info/?l=linux-kernel&m=146799838429328&w=2
> 

Fixed.

>> +	base->expired_count = 0;
>> +}
>> +
>> +static int __collect_expired_timers(struct timer_base *base)
>> +{
>> +	unsigned long clk;
>>   	struct hlist_head *vec;
>> -	int i, levels = 0;
>> +	int i;
>>   	unsigned int idx;
> 
> See above
> 
>> +	/*
>> +	 * expire_timers() must be called at least once before we can
>> +	 * collect more timers
>> +	 */
>> +	if (base->expired_count)
>> +		goto end;
>> +
>> +	clk = base->clk;
>>   	for (i = 0; i < LVL_DEPTH; i++) {
>>   		idx = (clk & LVL_MASK) + i * LVL_SIZE;
>>   
>>   		if (__test_and_clear_bit(idx, base->pending_map)) {
>>   			vec = base->vectors + idx;
>> -			hlist_move_list(vec, heads++);
>> -			levels++;
>> +			hlist_move_list(vec,
>> +				&base->expired_lists[base->expired_count++]);
> 
> Eew. What's wrong with local variables ?
> 
>       struct hist_head *list = &base->expired_vectors;
> 
> at the top of this function and then do
> 
> 			hlist_move_list(vec, list++);
> 			base->expired_levels++;
> 
> or have a local count and use it as index to list[]. The code generation
> should be roughly the same, but I expect it to be better with the seperate
> increments.
> 

Done, looks nicer. I was trying to keep changes to existing code minimal.

>>   		}
>>   		/* Is it time to look at the next level? */
>>   		if (clk & LVL_CLK_MASK)
>> @@ -1366,7 +1394,8 @@ static int __collect_expired_timers(struct timer_base *base,
>>   		/* Shift clock for the next level granularity */
>>   		clk >>= LVL_CLK_SHIFT;
>>   	}
>> -	return levels;
>> +
>> +	end: return base->expired_count;
> 
> More Eeew! Can you please look how labels are placed in the
> kernel. Certainly not that way.
> 
> Aside of that the goto is silly. You can just return expired_count up at
> that conditional, or move the conditional to the caller.
> 

Replaced goto with simple return.

> Actually I do not understand that conditional at the top at all. The call
> site breaks out of the loop when the return value is > 0. So what's that
> for? Paranoia? If that's the case then you want a WARN_ONCE there, because
> that should never happen. Otherwise it's just pointless. If actually
> something relies on that, then it's disgusting.
> 
Paranoia. We should never hit this case unless TIMER_SOFTIRQ got raised
without expired timers. Added WARN_ONCE().

>>   }
>>   
>>   #ifdef CONFIG_NO_HZ_COMMON
>> @@ -1559,8 +1588,7 @@ void timer_clear_idle(void)
>>   	base->is_idle = false;
>>   }
>>   
>> -static int collect_expired_timers(struct timer_base *base,
>> -				  struct hlist_head *heads)
>> +static int collect_expired_timers(struct timer_base *base)
>>   {
>>   	/*
>>   	 * NOHZ optimization. After a long idle sleep we need to forward the
>> @@ -1581,16 +1609,41 @@ static int collect_expired_timers(struct timer_base *base,
>>   		}
>>   		base->clk = next;
>>   	}
>> -	return __collect_expired_timers(base, heads);
>> +	return __collect_expired_timers(base);
>>   }
>>   #else
>> -static inline int collect_expired_timers(struct timer_base *base,
>> -					 struct hlist_head *heads)
>> +static inline int collect_expired_timers(struct timer_base *base)
>>   {
>> -	return __collect_expired_timers(base, heads);
>> +	return __collect_expired_timers(base);
>>   }
>>   #endif
>>   
>> +/* Increments timer_base to current jiffies or until first expired
>> + * timer is found. Return number of expired timers. */
> 
> Sigh.
> 

Fixed comment formatting.

>> +static int find_expired_timers(struct timer_base *base)
>> +{
>> +	const unsigned long int end_clk = jiffies;
> 
> const ? unsigned long int ?
> 

Dropped the const. Didn't realize it violated a coding convention.

>> +	int expired_count;
>> +
>> +	while ( !(expired_count = collect_expired_timers(base)) &&
>> +			time_after_eq(end_clk, base->clk) ) {
> 
> These extra white spaces after ( and before ) are pointless and not kernel
> coding style.
> 
> What's worse is the order of your conditionals. Just look at the original
> code.
> 
>> +		base->clk++;
>> +	}
> 

Fixed.

> Aside of that this loop is fricking hard to read.
> 
> 	int levels = 0;
> 
> 	while (!levels && time_after_eq(jiffies, base->clk)) {
> 	      levels = collect_expired_timers(base, heads);
> 	      base->clk++;
> 	}
> 	
> 	return levels;
> 
> Is all what you need here, right? That's what the original loop does as
> well.
> 

Correct, but the original loop was in __run_timers() and this one is 
called from both __run_timers() and run_local_timers(), which is why I 
moved it to a separate function.

>> +
>> +	return expired_count;
>> +}
>> +
>> +/* Called from CPU tick routine to collect expired timers up to current
>> + * jiffies. Return number of expired timers. */
> 
> Wrong. It returns the number of levels which have expired timers. The
> number of actual timers per level is unknown as we move the complete list.
> 

Fixed comment.

>> +static int tick_find_expired(struct timer_base *base)
>> +{
>> +	int count;
> 
> Missing new line between declaration and code. checkpatch.pl is wrong on a
> lot of things, but it would have told you.
> 

Fixed.

>> +	raw_spin_lock(&base->lock);
>> +	count = find_expired_timers(base);
>> +	raw_spin_unlock(&base->lock);
>> +	return count;
> 
> Please be consistent with the names. We use 'levels' throughout all the other
> functions. Random variable names are just confusing.
> 

Renamed "count" to "levels" in timer_base and various functions.

>> +}
>> +
>>   /*
>>    * Called from the timer interrupt handler to charge one tick to the current
>>    * process.  user_tick is 1 if the tick is user time, 0 for system.
>> @@ -1618,22 +1671,11 @@ void update_process_times(int user_tick)
>>    */
>>   static inline void __run_timers(struct timer_base *base)
>>   {
>> -	struct hlist_head heads[LVL_DEPTH];
>> -	int levels;
>> -
>> -	if (!time_after_eq(jiffies, base->clk))
>> -		return;
>> -
>>   	raw_spin_lock_irq(&base->lock);
>>   
>> -	while (time_after_eq(jiffies, base->clk)) {
>> +	while (find_expired_timers(base))
>> +		expire_timers(base);
> 
> Now I understand that extra conditional above. That's crap, really. Two
> ways to solve that:
> 
> 	do {
>         		expire_timers(base);
> 	} while (find_expired_timers(base));
> 
> 	which requires a check for base->expired_levels inside of
> 	expire_timers().
> 
> or
> 
>         if (base->expired_levels)
>         		expire_timers(base);
> 
> 	while (find_expired_timers(base))
>         		expire_timers(base);
> 

The do-while approach works for me. expire_timers() already noops when
expired_levels is zero. However, I would like to keep the
WARN_ONCE(expired_levels) check in __collect_expired_timers() as a
sanity check.

>>   	raw_spin_unlock_irq(&base->lock);
>>   	wakeup_timer_waiters(base);
> 
> Errm. Please submit patches against mainline. This is RT only. On mainline
> the overhead of raising the softirq is not that big, but the exercise is
> the same.
> 

I have been submitting to both mailing lists simultaneously.

>> @@ -1644,12 +1686,16 @@ static inline void __run_timers(struct timer_base *base)
>>   static __latent_entropy void run_timer_softirq(struct softirq_action *h)
>>   {
>>   	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
>> +	int* block_softirq = this_cpu_ptr(&block_softirqs);
> 
> Sigh. A pointer is declared with:
> 
> 	int *p;
> 
> and not
> 
> 	int* p;
> 
>>   	irq_work_tick_soft();
>>   
>>   	__run_timers(base);
>>   	if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active)
>>   		__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
>> +
>> +	/* Allow new TIMER_SOFTIRQs to get scheduled by run_local_timers() */
>> +	WRITE_ONCE(*block_softirq, 0);
> 
> You are in interrupt enabled code here. So you actually might miss a wakeup
> and delay it to the next tick. If that's your intention then please
> document it proper. If not, you need to disable interrupts around the write
> and recheck stuff.
> 

I'm not sure what you mean exaclty. My intention here is to only permit 
new TIMER_SOFTIRQs to get raised by run_local_timers(). See updated 
commit message for details.

> Also the WRITE_ONCE() is pointless. The compiler cannot reorder the
> write. And it does not protect you from racing with the hard interrupt. So
> for the sloppy variant a simple:
> 
>         base->softirq_activated = false;
> 
> is sufficient.
> 
>>   }
>>   
>>   /*
>> @@ -1657,18 +1703,28 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h)
>>    */
>>   void run_local_timers(void)
>>   {
>> -	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
>> +	int* block_softirq = this_cpu_ptr(&block_softirqs);
>> +	struct timer_base *base;
>>   
>>   	hrtimer_run_queues();
>> +
>> +	/* Skip if TIMER_SOFTIRQ is already running for this CPU */
>> +	if (READ_ONCE(*block_softirq))
>> +		return;
>> +
>> +	base = this_cpu_ptr(&timer_bases[BASE_STD]);
> 
> And this here becomes:
> 
>      	if (base->softirq_activated)
>      		return;
> 
>> +
>>   	/* Raise the softirq only if required. */
>> -	if (time_before(jiffies, base->clk)) {
>> +	if (time_before(jiffies, base->clk) || !tick_find_expired(base)) {
>>   		if (!IS_ENABLED(CONFIG_NO_HZ_COMMON) || !base->nohz_active)
>>   			return;
>>   		/* CPU is awake, so check the deferrable base. */
>>   		base++;
>> -		if (time_before(jiffies, base->clk))
>> +		if (time_before(jiffies, base->clk) || !tick_find_expired(base))
>>   			return;
> 
> To make that work, all you need here is:
> 
> 		base--;
> 
>>   	}
> 
> and
> 	base->softirq_activated = true;
> 

Done. Dropped WRITE_ONCE().

>>   static void __init init_timer_cpu(int cpu)
>>   {
>>   	struct timer_base *base;
>> +	int* block_softirq;
>>   	int i;
>>   
>>   	for (i = 0; i < NR_BASES; i++) {
>> @@ -1852,6 +1910,10 @@ static void __init init_timer_cpu(int cpu)
>>   #ifdef CONFIG_PREEMPT_RT_FULL
>>   		init_swait_queue_head(&base->wait_for_running_timer);
>>   #endif
>> +		base->expired_count = 0;
>> +
>> +		block_softirq = per_cpu_ptr(&block_softirqs, cpu);
>> +		*block_softirq = 0;
> 
> What kind of voodoo initialization is this? Do you not trust BSS? Or do you
> not make sure that the stuff is brought into proper state when a CPU goes
> offline?
> 

Yea, this is pointless. Not sure what I was thinking. Removed.

> Aside of the above, this patch wants to be split into two pieces:
> 
>    1) Embedd the hlist heads for expired bucket collection into base
>       struct and adjust the code accordingly.
> 
>    2) Implement the conditional softirq raise machinery
> 

I agree. I split it and will submit a PATCH v3 shortly.

> Thanks,
> 
> 	tglx
> 

Thanks,
Haris

  reply	other threads:[~2017-08-03 21:04 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-13 21:44 [RFC v2] timers: Don't wake ktimersoftd on every tick Haris Okanovic
2016-12-23 17:28 ` Sebastian Andrzej Siewior
2016-12-28 18:06   ` Haris Okanovic
2017-02-03 16:51 ` Sebastian Andrzej Siewior
2017-02-03 18:21   ` [PATCH] " Haris Okanovic
2017-02-10 17:02     ` Sebastian Andrzej Siewior
2017-05-26 17:16       ` [PATCH] Revert "timers: Don't wake ktimersoftd on every tick" Anna-Maria Gleixner
2017-05-26 19:27         ` Haris Okanovic
2017-05-26 19:49           ` Thomas Gleixner
2017-05-26 20:25             ` Haris Okanovic
2017-05-26 20:50               ` Thomas Gleixner
2017-06-02 14:37                 ` Haris Okanovic
2017-06-04 14:17                   ` Thomas Gleixner
2017-07-17 21:58                     ` Haris Okanovic
2017-07-17 22:04                     ` [PATCH v2] timers: Don't wake ktimersoftd on every tick Haris Okanovic
2017-07-18 21:33                       ` Thomas Gleixner
2017-08-03 21:04                         ` Haris Okanovic [this message]
2017-08-03 21:06                         ` [PATCH v3 1/2] " Haris Okanovic
2017-08-03 21:06                           ` [PATCH v3 2/2] timers: Don't search for expired timers while TIMER_SOFTIRQ is scheduled Haris Okanovic
2018-01-05 19:37                             ` Haris Okanovic
2018-03-01 15:49                               ` Haris Okanovic
2018-03-01 15:54                                 ` Thomas Gleixner
2018-03-01 16:35                                   ` Haris Okanovic
2018-03-01 16:47                                 ` Sebastian Andrzej Siewior
2018-03-01 18:37                                   ` Haris Okanovic
2018-03-01 19:06                                     ` Sebastian Andrzej Siewior
2018-03-02 14:52                                     ` Sebastian Andrzej Siewior
2018-03-02 16:29                                       ` Haris Okanovic
2018-03-02 16:39                                         ` Sebastian Andrzej Siewior
2018-03-02 17:19                                           ` Haris Okanovic
2018-03-06 23:39                         ` [PATCH v4 1/2] timers: Don't wake ktimersoftd on every tick Haris Okanovic
2018-03-06 23:39                           ` [PATCH v4 2/2] timers: Don't search for expired timers while TIMER_SOFTIRQ is scheduled Haris Okanovic
     [not found]                           ` <1523332961.4827.1.camel@gmx.de>
2018-04-12 15:00                             ` [PATCH v4 1/2] timers: Don't wake ktimersoftd on every tick Haris Okanovic
2018-06-19 12:43                               ` Daniel Bristot de Oliveira
2018-06-20 14:24                                 ` Haris Okanovic
2018-06-28 16:36                               ` Haris Okanovic
2018-06-28 16:40                                 ` [PATCH v5 " Haris Okanovic
2018-06-28 16:40                                   ` [PATCH v5 2/2] timers: Don't search for expired timers while TIMER_SOFTIRQ is scheduled Haris Okanovic
2018-07-13 12:01                                   ` [PATCH v5 1/2] timers: Don't wake ktimersoftd on every tick Anna-Maria Gleixner
2018-07-13 14:37                                     ` Haris Okanovic
2017-05-27  7:47         ` [PATCH] Revert "timers: Don't wake ktimersoftd on every tick" Sebastian Andrzej Siewior
2017-02-03 18:27   ` [RFC v2] timers: Don't wake ktimersoftd on every tick Haris Okanovic

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1f71fbdb-810a-06b8-c652-d8b560f64e2a@ni.com \
    --to=haris.okanovic@ni.com \
    --cc=anna-maria@linutronix.de \
    --cc=bigeasy@linutronix.de \
    --cc=gratian.crisan@ni.com \
    --cc=harisokn@gmail.com \
    --cc=julia.cartwright@ni.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).