NO_HZ migration of TCP ack timers

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* NO_HZ migration of TCP ack timers
@ 2010-02-18  5:28 Anton Blanchard
  2010-02-18  8:08 ` Andi Kleen
  2010-02-26 12:26 ` David Miller
  0 siblings, 2 replies; 7+ messages in thread
From: Anton Blanchard @ 2010-02-18  5:28 UTC (permalink / raw)
  To: arun, tglx; +Cc: davem, linux-kernel

Hi,

We have a networking workload on a large ppc64 box that is spending a lot
of its time in mod_timer(). One backtrace looks like:

83.25%  [k] ._spin_lock_irqsave
            |          
            |--99.62%-- .lock_timer_base
            |          .mod_timer
            |          .sk_reset_timer
            |          |          
            |          |--84.77%-- .tcp_send_delayed_ack
            |          |          .__tcp_ack_snd_check
            |          |          .tcp_rcv_established
            |          |          .tcp_v4_do_rcv

            |          |--12.72%-- .tcp_ack
            |          |          .tcp_rcv_established
            |          |          .tcp_v4_do_rcv

So it's mod_timer being called from the TCP ack timer code. It looks like
commit eea08f32adb3f97553d49a4f79a119833036000a (timers: Logic to move non
pinned timers) is causing it, in particular:

#if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
        if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) {
                int preferred_cpu = get_nohz_load_balancer();

                if (preferred_cpu >= 0)
                        cpu = preferred_cpu;
        }
#endif

and:

echo 0 > /proc/sys/kernel/timer_migration

makes the problem go away.

I think the problem is the CPU is most likely to be idle when an rx networking
interrupt comes in. It seems the wrong thing to do to migrate any ack timers
off the current cpu taking the interrupt, and with enough networks we train
wreck transferring everyones ack timers to the nohz load balancer cpu. 

What should we do? Should we use mod_timer_pinned here? Or is this an issue
other areas might see (eg the block layer) and we should instead avoid
migrating timers created out of interrupts.

Anton

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NO_HZ migration of TCP ack timers
  2010-02-18  5:28 NO_HZ migration of TCP ack timers Anton Blanchard
@ 2010-02-18  8:08 ` Andi Kleen
  2010-02-18  9:55   ` Anton Blanchard
  2010-02-18 10:33   ` Arun R Bharadwaj
  2010-02-26 12:26 ` David Miller
  1 sibling, 2 replies; 7+ messages in thread
From: Andi Kleen @ 2010-02-18  8:08 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: arun, tglx, davem, linux-kernel, arjan, venkatesh.pallipadi

Anton Blanchard <anton@samba.org> writes:

> echo 0 > /proc/sys/kernel/timer_migration
>
> makes the problem go away.
>
> I think the problem is the CPU is most likely to be idle when an rx networking
> interrupt comes in. It seems the wrong thing to do to migrate any ack timers
> off the current cpu taking the interrupt, and with enough networks we train
> wreck transferring everyones ack timers to the nohz load balancer cpu. 

If the nohz balancer CPU is otherwise idle, shouldn't it have enough
cycles to handle acks for everyone? Is the problem the cache line
transfer time?

But yes if it's non idle the migration might need to spread out 
to more CPUs.

>
> What should we do? Should we use mod_timer_pinned here? Or is this an issue

Sounds like something that should be controlled by the cpufreq governour's
idle predictor? Only migrate if predicted idle time is long enough.
It's essentially the same problem as deciding how deeply idle to put
a CPU. Heavy measures only pay off if the expected time is long enough.

> other areas might see (eg the block layer) and we should instead avoid
> migrating timers created out of interrupts.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NO_HZ migration of TCP ack timers
  2010-02-18  8:08 ` Andi Kleen
@ 2010-02-18  9:55   ` Anton Blanchard
  2010-02-18 10:08     ` Andi Kleen
  2010-02-18 10:33   ` Arun R Bharadwaj
  1 sibling, 1 reply; 7+ messages in thread
From: Anton Blanchard @ 2010-02-18  9:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: arun, tglx, davem, linux-kernel, arjan, venkatesh.pallipadi


Hi Andi,

> If the nohz balancer CPU is otherwise idle, shouldn't it have enough
> cycles to handle acks for everyone? Is the problem the cache line
> transfer time?

Yeah, I think the timer spinlock on the nohz balancer cpu ends up being a
global lock for every other cpu trying to migrate their ack timers to it.

> Sounds like something that should be controlled by the cpufreq governour's
> idle predictor? Only migrate if predicted idle time is long enough.
> It's essentially the same problem as deciding how deeply idle to put
> a CPU. Heavy measures only pay off if the expected time is long enough.

Interesting idea, it seems like we do need a better understanding of
how idle a cpu is, not just that it is idle when mod_timer is called.

Anton

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NO_HZ migration of TCP ack timers
  2010-02-18  9:55   ` Anton Blanchard
@ 2010-02-18 10:08     ` Andi Kleen
  0 siblings, 0 replies; 7+ messages in thread
From: Andi Kleen @ 2010-02-18 10:08 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Andi Kleen, arun, tglx, davem, linux-kernel, arjan,
	venkatesh.pallipadi

On Thu, Feb 18, 2010 at 08:55:30PM +1100, Anton Blanchard wrote:
> 
> Hi Andi,
> 
> > If the nohz balancer CPU is otherwise idle, shouldn't it have enough
> > cycles to handle acks for everyone? Is the problem the cache line
> > transfer time?
> 
> Yeah, I think the timer spinlock on the nohz balancer cpu ends up being a
> global lock for every other cpu trying to migrate their ack timers to it.

And they do that often for short idle periods?

For longer idle periods that should be not too bad.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NO_HZ migration of TCP ack timers
  2010-02-18  8:08 ` Andi Kleen
  2010-02-18  9:55   ` Anton Blanchard
@ 2010-02-18 10:33   ` Arun R Bharadwaj
  2010-02-18 16:03     ` Andi Kleen
  1 sibling, 1 reply; 7+ messages in thread
From: Arun R Bharadwaj @ 2010-02-18 10:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Anton Blanchard, tglx, davem, linux-kernel, arjan,
	venkatesh.pallipadi, Arun Bharadwaj

* Andi Kleen <andi@firstfloor.org> [2010-02-18 09:08:35]:

> Anton Blanchard <anton@samba.org> writes:
> 
> > echo 0 > /proc/sys/kernel/timer_migration
> >
> > makes the problem go away.
> >
> > I think the problem is the CPU is most likely to be idle when an rx networking
> > interrupt comes in. It seems the wrong thing to do to migrate any ack timers
> > off the current cpu taking the interrupt, and with enough networks we train
> > wreck transferring everyones ack timers to the nohz load balancer cpu. 
> 
> If the nohz balancer CPU is otherwise idle, shouldn't it have enough
> cycles to handle acks for everyone? Is the problem the cache line
> transfer time?
> 
> But yes if it's non idle the migration might need to spread out 
> to more CPUs.
> 
> >
> > What should we do? Should we use mod_timer_pinned here? Or is this an issue
> 
> Sounds like something that should be controlled by the cpufreq governour's
> idle predictor? Only migrate if predicted idle time is long enough.
> It's essentially the same problem as deciding how deeply idle to put
> a CPU. Heavy measures only pay off if the expected time is long enough.
> 

cpuidle infrastructure hs statistics about the idle times for
all the cpus. Maybe we can look to use this infrastructure to decide
whether to migrate timers or not?

arun

> > other areas might see (eg the block layer) and we should instead avoid
> > migrating timers created out of interrupts.
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NO_HZ migration of TCP ack timers
  2010-02-18 10:33   ` Arun R Bharadwaj
@ 2010-02-18 16:03     ` Andi Kleen
  0 siblings, 0 replies; 7+ messages in thread
From: Andi Kleen @ 2010-02-18 16:03 UTC (permalink / raw)
  To: Arun R Bharadwaj
  Cc: Andi Kleen, Anton Blanchard, tglx, davem, linux-kernel, arjan,
	venkatesh.pallipadi

> > > What should we do? Should we use mod_timer_pinned here? Or is this an issue
> > 
> > Sounds like something that should be controlled by the cpufreq governour's
> > idle predictor? Only migrate if predicted idle time is long enough.
> > It's essentially the same problem as deciding how deeply idle to put
> > a CPU. Heavy measures only pay off if the expected time is long enough.
> > 
> 
> cpuidle infrastructure hs statistics about the idle times for
> all the cpus. Maybe we can look to use this infrastructure to decide
> whether to migrate timers or not?

Yes sorry I reallhy meant cpuidle when I wrote cpufreq. 
That's what I suggested too.

But if the problem is lock contention on the target CPU that would
still not completely solve it, just make it less frequent depending
on the idle pattern.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NO_HZ migration of TCP ack timers
  2010-02-18  5:28 NO_HZ migration of TCP ack timers Anton Blanchard
  2010-02-18  8:08 ` Andi Kleen
@ 2010-02-26 12:26 ` David Miller
  1 sibling, 0 replies; 7+ messages in thread
From: David Miller @ 2010-02-26 12:26 UTC (permalink / raw)
  To: anton; +Cc: arun, tglx, linux-kernel

From: Anton Blanchard <anton@samba.org>
Date: Thu, 18 Feb 2010 16:28:20 +1100

> I think the problem is the CPU is most likely to be idle when an rx networking
> interrupt comes in. It seems the wrong thing to do to migrate any ack timers
> off the current cpu taking the interrupt, and with enough networks we train
> wreck transferring everyones ack timers to the nohz load balancer cpu. 

This migration against the very design of all of the TCP timers in the
tree currently.

For TCP, even when the timer is no longer needed, we don't cancel the
timer.  We do this in order to avoid touching the timer for the cancel
from a cpu other than the one the timer was scheduled on.

The timer therefore is always accessed, cache hot, locally to the cpu
where it was scheduled.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-02-26 12:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-18  5:28 NO_HZ migration of TCP ack timers Anton Blanchard
2010-02-18  8:08 ` Andi Kleen
2010-02-18  9:55   ` Anton Blanchard
2010-02-18 10:08     ` Andi Kleen
2010-02-18 10:33   ` Arun R Bharadwaj
2010-02-18 16:03     ` Andi Kleen
2010-02-26 12:26 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox