[git pull] scheduler/misc fixes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [git pull] scheduler/misc fixes
@ 2008-04-24 22:55 Ingo Molnar
  2008-04-25  3:46 ` David Miller
  0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2008-04-24 22:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Andrew Morton, Peter Zijlstra


Linus, please pull the latest scheduler/misc fixes git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes.git for-linus

a scheduler fix, a (long-standing) seqlock fix and a softlockup+nohz 
fix.

Thanks,

	Ingo

------------------>
Ingo Molnar (2):
      seqlock: livelock fix
      softlockup: fix NOHZ wakeup

Peter Zijlstra (1):
      sched: fix share (re)distribution

 include/linux/seqlock.h  |   46 ++++++++++++++++++++++++++++----------------
 kernel/sched.c           |   47 +--------------------------------------------
 kernel/time/tick-sched.c |    1 +
 3 files changed, 32 insertions(+), 62 deletions(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 26e4925..632205c 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -85,23 +85,29 @@ static inline int write_tryseqlock(seqlock_t *sl)
 /* Start of read calculation -- fetch last complete writer token */
 static __always_inline unsigned read_seqbegin(const seqlock_t *sl)
 {
-	unsigned ret = sl->sequence;
+	unsigned ret;
+
+repeat:
+	ret = sl->sequence;
 	smp_rmb();
+	if (unlikely(ret & 1)) {
+		cpu_relax();
+		goto repeat;
+	}
+
 	return ret;
 }
 
-/* Test if reader processed invalid data.
- * If initial values is odd, 
- *	then writer had already started when section was entered
- * If sequence value changed
- *	then writer changed data while in section
- *    
- * Using xor saves one conditional branch.
+/*
+ * Test if reader processed invalid data.
+ *
+ * If sequence value changed then writer changed data while in section.
  */
-static __always_inline int read_seqretry(const seqlock_t *sl, unsigned iv)
+static __always_inline int read_seqretry(const seqlock_t *sl, unsigned start)
 {
 	smp_rmb();
-	return (iv & 1) | (sl->sequence ^ iv);
+
+	return (sl->sequence != start);
 }
 
 
@@ -122,20 +128,26 @@ typedef struct seqcount {
 /* Start of read using pointer to a sequence counter only.  */
 static inline unsigned read_seqcount_begin(const seqcount_t *s)
 {
-	unsigned ret = s->sequence;
+	unsigned ret;
+
+repeat:
+	ret = s->sequence;
 	smp_rmb();
+	if (unlikely(ret & 1)) {
+		cpu_relax();
+		goto repeat;
+	}
 	return ret;
 }
 
-/* Test if reader processed invalid data.
- * Equivalent to: iv is odd or sequence number has changed.
- *                (iv & 1) || (*s != iv)
- * Using xor saves one conditional branch.
+/*
+ * Test if reader processed invalid data because sequence number has changed.
  */
-static inline int read_seqcount_retry(const seqcount_t *s, unsigned iv)
+static inline int read_seqcount_retry(const seqcount_t *s, unsigned start)
 {
 	smp_rmb();
-	return (iv & 1) | (s->sequence ^ iv);
+
+	return s->sequence != start;
 }
 
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 0014b03..85e1721 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1657,42 +1657,6 @@ void aggregate_group_weight(struct task_group *tg, struct sched_domain *sd)
 }
 
 /*
- * Redistribute tg->shares amongst all tg->cfs_rq[]s.
- */
-static void __aggregate_redistribute_shares(struct task_group *tg)
-{
-	int i, max_cpu = smp_processor_id();
-	unsigned long rq_weight = 0;
-	unsigned long shares, max_shares = 0, shares_rem = tg->shares;
-
-	for_each_possible_cpu(i)
-		rq_weight += tg->cfs_rq[i]->load.weight;
-
-	for_each_possible_cpu(i) {
-		/*
-		 * divide shares proportional to the rq_weights.
-		 */
-		shares = tg->shares * tg->cfs_rq[i]->load.weight;
-		shares /= rq_weight + 1;
-
-		tg->cfs_rq[i]->shares = shares;
-
-		if (shares > max_shares) {
-			max_shares = shares;
-			max_cpu = i;
-		}
-		shares_rem -= shares;
-	}
-
-	/*
-	 * Ensure it all adds up to tg->shares; we can loose a few
-	 * due to rounding down when computing the per-cpu shares.
-	 */
-	if (shares_rem)
-		tg->cfs_rq[max_cpu]->shares += shares_rem;
-}
-
-/*
  * Compute the weight of this group on the given cpus.
  */
 static
@@ -1701,18 +1665,11 @@ void aggregate_group_shares(struct task_group *tg, struct sched_domain *sd)
 	unsigned long shares = 0;
 	int i;
 
-again:
 	for_each_cpu_mask(i, sd->span)
 		shares += tg->cfs_rq[i]->shares;
 
-	/*
-	 * When the span doesn't have any shares assigned, but does have
-	 * tasks to run do a machine wide rebalance (should be rare).
-	 */
-	if (unlikely(!shares && aggregate(tg, sd)->rq_weight)) {
-		__aggregate_redistribute_shares(tg);
-		goto again;
-	}
+	if ((!shares && aggregate(tg, sd)->rq_weight) || shares > tg->shares)
+		shares = tg->shares;
 
 	aggregate(tg, sd)->shares = shares;
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d358d4e..b854a89 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -393,6 +393,7 @@ void tick_nohz_restart_sched_tick(void)
 		sub_preempt_count(HARDIRQ_OFFSET);
 	}
 
+	touch_softlockup_watchdog();
 	/*
 	 * Cancel the scheduled timer and restore the tick
 	 */

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-24 22:55 [git pull] scheduler/misc fixes Ingo Molnar
@ 2008-04-25  3:46 ` David Miller
  2008-04-25  7:48   ` Peter Zijlstra
  2008-04-25  8:04   ` Ingo Molnar
  0 siblings, 2 replies; 13+ messages in thread
From: David Miller @ 2008-04-25  3:46 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, akpm, a.p.zijlstra, viro, alan

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 25 Apr 2008 00:55:30 +0200

> 
> Linus, please pull the latest scheduler/misc fixes git tree from:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes.git for-linus
> 
> a scheduler fix, a (long-standing) seqlock fix and a softlockup+nohz 
> fix.

Correction, the softlock+nohz patch here doesn't actually fix the
reported regression.  It fixes some other theoretical bug you
discovered while trying to fix the regression reports.

Ingo, you've known exactly which changeset adds the regression for
more than 4 days now.

It's a small changeset, and you have no idea what the problem is.  The
changeset in question is labelled as a bug fix, but it's pointless for
it to be there if it adds problems too.

Please revert this change until you understand the problem better.
Please!  I even gave Peter Z. access to my Niagara2 system, so you can
ask him to help you figure out the reason for all of these softlockup
regressions added by the sched merge 4 days ago.

Please also add the trivial regression fix I posted to you earlier,
included again below, you seem to be selectively reading email from me
and missing bug fix patches in the process.

sched: Use alloc_bootmem() instead of alloc_bootmem_low()

There is no guarentee that there is physical ram below 4GB, and in
fact many boxes don't have exactly that.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/kernel/sched.c b/kernel/sched.c
index 0014b03..09ca69b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8128,7 +8128,7 @@ void __init sched_init(void)
 	 * we use alloc_bootmem().
 	 */
 	if (alloc_size) {
-		ptr = (unsigned long)alloc_bootmem_low(alloc_size);
+		ptr = (unsigned long)alloc_bootmem(alloc_size);

 #ifdef CONFIG_FAIR_GROUP_SCHED
 		init_task_group.se = (struct sched_entity **)ptr;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-25  3:46 ` David Miller
@ 2008-04-25  7:48   ` Peter Zijlstra
  2008-04-25  7:57     ` David Miller
  2008-04-25 10:19     ` Peter Zijlstra
  2008-04-25  8:04   ` Ingo Molnar
  1 sibling, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2008-04-25  7:48 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, torvalds, linux-kernel, akpm, viro, alan, Thomas Gleixner

On Thu, 2008-04-24 at 20:46 -0700, David Miller wrote:
> From: Ingo Molnar <mingo@elte.hu>
> Date: Fri, 25 Apr 2008 00:55:30 +0200
> 
> > 
> > Linus, please pull the latest scheduler/misc fixes git tree from:
> > 
> >    git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes.git for-linus
> > 
> > a scheduler fix, a (long-standing) seqlock fix and a softlockup+nohz 
> > fix.
> 
> Correction, the softlock+nohz patch here doesn't actually fix the
> reported regression.  It fixes some other theoretical bug you
> discovered while trying to fix the regression reports.


> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index d358d4e..b854a89 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -393,6 +393,7 @@ void tick_nohz_restart_sched_tick(void)
>                 sub_preempt_count(HARDIRQ_OFFSET);
>         }
>  
> +       touch_softlockup_watchdog();
>         /*
>          * Cancel the scheduled timer and restore the tick
>          */


That actually does solve the problem (I tested, as have you:
http://lkml.org/lkml/2008/4/22/98).

However, in that email referenced above you rightly point out that its
likely working around a bug rather than fixing it. I agree.

So it used to work; and these two commits:


commit 15934a37324f32e0fda633dc7984a671ea81cd75
Author: Guillaume Chazarain <guichaz@yahoo.fr>
Date:   Sat Apr 19 19:44:57 2008 +0200

    sched: fix rq->clock overflows detection with CONFIG_NO_HZ

    When using CONFIG_NO_HZ, rq->tick_timestamp is not updated every TICK_NSEC.
    We check that the number of skipped ticks matches the clock jump seen in
    __update_rq_clock().

    Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>


commit 27ec4407790d075c325e1f4da0a19c56953cce23
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Feb 28 21:00:21 2008 +0100

    sched: make cpu_clock() globally synchronous

    Alexey Zaytsev reported (and bisected) that the introduction of
    cpu_clock() in printk made the timestamps jump back and forth.

    Make cpu_clock() more reliable while still keeping it fast when it's
    called frequently.

    Signed-off-by: Ingo Molnar <mingo@elte.hu>


break the thing - that is, they generate false softlockup msgs.

Reverting them does indeed make them go away.

You've said (http://lkml.org/lkml/2008/4/24/77 and in testing on your
machine I have indeed found that so) that sparc64's sched_clock() is
rock solid.

So what I've done is (-linus without the above two commits):

unsigned long long cpu_clock(int cpu)
{
        unsigned long long now;
        unsigned long flags;
        struct rq *rq;

        /*
         * Only call sched_clock() if the scheduler has already been
         * initialized (some code might call cpu_clock() very early):
         */
        if (unlikely(!scheduler_running))
                return 0;

#if 0
        local_irq_save(flags);
        rq = cpu_rq(cpu);
        update_rq_clock(rq);
        now = rq->clock;
        local_irq_restore(flags);
#else
        now = sched_clock();
#endif

        return now;
}

This cuts out all the rq->clock logic and should give a stable time on
your machine.

Lo and Behold, the softlockups are back!


Now things get a little hazy:

 a) 15934a37324f32e0fda633dc7984a671ea81cd75 does indeed fix a bug in
    rq->clock; without that patch it compresses nohz time to a single
    jiffie, so cpu_clock() which (without the above hack) is based on
    rq->clock will be short on nohz time. This can 'hide' the clock jump
    and thus hide false positives.


 b) there is commit:

---
commit d3938204468dccae16be0099a2abf53db4ed0505
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Wed Nov 28 15:52:56 2007 +0100
    softlockup: fix false positives on CONFIG_NOHZ

    David Miller reported soft lockup false-positives that trigger
    on NOHZ due to CPUs idling for more than 10 seconds.

    The solution is touch the softlockup watchdog when we return from
    idle. (by definition we are not 'locked up' when we were idle)

     http://bugzilla.kernel.org/show_bug.cgi?id=9409

    Reported-by: David Miller <davem@davemloft.net>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 27a2338..cb89fa8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -133,6 +133,8 @@ void tick_nohz_update_jiffies(void)
 	if (!ts->tick_stopped)
 		return;
 
+	touch_softlockup_watchdog();
+
 	cpu_clear(cpu, nohz_cpu_mask);
 	now = ktime_get();

---

    which should 'fix' this problem.


 c) there are 'IPI' handlers on SPARC64 that look like they can wake
    the CPU from idle sleep but do not appear to call irq_enter() which
    has the above patch's touch_softlock_watchdog() in its callchain.

tl0_irq1:       TRAP_IRQ(smp_call_function_client, 1)
tl0_irq2:       TRAP_IRQ(smp_receive_signal_client, 2)
tl0_irq3:       TRAP_IRQ(smp_penguin_jailcell, 3)
tl0_irq4:       TRAP_IRQ(smp_new_mmu_context_version_client, 4)



So the current working thesis is that the bug in a) hides a real problem
not quite fixed by b) and exploited by c).

When I tried to build a state machine to validate this thesis the kernel
blew up on me, so I guess I need to go get my morning juice and try
again ;-)

Insight into the above is appreciated as I'm out on a limb on two
fronts: nohz and sparc64 :-)

/me goes onward testing..





^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-25  7:48   ` Peter Zijlstra
@ 2008-04-25  7:57     ` David Miller
  2008-04-25  8:13       ` Peter Zijlstra
  2008-04-25  8:24       ` Peter Zijlstra
  2008-04-25 10:19     ` Peter Zijlstra
  1 sibling, 2 replies; 13+ messages in thread
From: David Miller @ 2008-04-25  7:57 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: mingo, torvalds, linux-kernel, akpm, viro, alan, tglx

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri, 25 Apr 2008 09:48:30 +0200

>  c) there are 'IPI' handlers on SPARC64 that look like they can wake
>     the CPU from idle sleep but do not appear to call irq_enter() which
>     has the above patch's touch_softlock_watchdog() in its callchain.
> 
> tl0_irq1:       TRAP_IRQ(smp_call_function_client, 1)
> tl0_irq2:       TRAP_IRQ(smp_receive_signal_client, 2)
> tl0_irq3:       TRAP_IRQ(smp_penguin_jailcell, 3)
> tl0_irq4:       TRAP_IRQ(smp_new_mmu_context_version_client, 4)
> 
> 
> 
> So the current working thesis is that the bug in a) hides a real problem
> not quite fixed by b) and exploited by c).

The equivalent to smp_receive_signal_client() on x86
(smp_reschedule_interrupt) doesn't do an irq_enter() either.

However x86 does do an irq_enter() for smp_call_function() interrupt
handling.

What is the rule in these cases?

Anyways, does the following patch fix the problem?

diff --git a/arch/sparc64/kernel/smp.c b/arch/sparc64/kernel/smp.c
index 524b889..bf4ef84 100644
--- a/arch/sparc64/kernel/smp.c
+++ b/arch/sparc64/kernel/smp.c
@@ -866,14 +866,21 @@ void smp_call_function_client(int irq, struct pt_regs *regs)
 	void *info = call_data->info;
 
 	clear_softint(1 << irq);
+
+	irq_enter();
+
+	if (!call_data->wait) {
+		/* let initiator proceed after getting data */
+		atomic_inc(&call_data->finished);
+	}
+
+	func(info);
+
+	irq_exit();
+
 	if (call_data->wait) {
 		/* let initiator proceed only after completion */
-		func(info);
-		atomic_inc(&call_data->finished);
-	} else {
-		/* let initiator proceed after getting data */
 		atomic_inc(&call_data->finished);
-		func(info);
 	}
 }
 



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-25  3:46 ` David Miller
  2008-04-25  7:48   ` Peter Zijlstra
@ 2008-04-25  8:04   ` Ingo Molnar
  2008-04-25  8:07     ` David Miller
  1 sibling, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2008-04-25  8:04 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, linux-kernel, akpm, a.p.zijlstra, viro, alan


* David Miller <davem@davemloft.net> wrote:

> Please also add the trivial regression fix I posted to you earlier, 
> included again below, you seem to be selectively reading email from me 
> and missing bug fix patches in the process.
> 
> sched: Use alloc_bootmem() instead of alloc_bootmem_low()

thanks David, i queued your fix up.

should we perhaps make alloc_bootmem_low() generate a warning instead of 
breaking the bootup? Especially on systems that have no physical RAM 
below 4GB at all the whole notion of 'lowmem' is meaningless.

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-25  8:04   ` Ingo Molnar
@ 2008-04-25  8:07     ` David Miller
  0 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2008-04-25  8:07 UTC (permalink / raw)
  To: mingo; +Cc: torvalds, linux-kernel, akpm, a.p.zijlstra, viro, alan

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 25 Apr 2008 10:04:21 +0200

> should we perhaps make alloc_bootmem_low() generate a warning instead of 
> breaking the bootup? Especially on systems that have no physical RAM 
> below 4GB at all the whole notion of 'lowmem' is meaningless.

It spits out an error to the console.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-25  7:57     ` David Miller
@ 2008-04-25  8:13       ` Peter Zijlstra
  2008-04-25  8:24       ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2008-04-25  8:13 UTC (permalink / raw)
  To: David Miller; +Cc: mingo, torvalds, linux-kernel, akpm, viro, alan, tglx

On Fri, 2008-04-25 at 00:57 -0700, David Miller wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Fri, 25 Apr 2008 09:48:30 +0200
> 
> >  c) there are 'IPI' handlers on SPARC64 that look like they can wake
> >     the CPU from idle sleep but do not appear to call irq_enter() which
> >     has the above patch's touch_softlock_watchdog() in its callchain.
> > 
> > tl0_irq1:       TRAP_IRQ(smp_call_function_client, 1)
> > tl0_irq2:       TRAP_IRQ(smp_receive_signal_client, 2)
> > tl0_irq3:       TRAP_IRQ(smp_penguin_jailcell, 3)
> > tl0_irq4:       TRAP_IRQ(smp_new_mmu_context_version_client, 4)
> > 
> > 
> > 
> > So the current working thesis is that the bug in a) hides a real problem
> > not quite fixed by b) and exploited by c).
> 
> The equivalent to smp_receive_signal_client() on x86
> (smp_reschedule_interrupt) doesn't do an irq_enter() either.
> 
> However x86 does do an irq_enter() for smp_call_function() interrupt
> handling.
> 
> What is the rule in these cases?
> 
> Anyways, does the following patch fix the problem?

Sadly, no :-(

> diff --git a/arch/sparc64/kernel/smp.c b/arch/sparc64/kernel/smp.c
> index 524b889..bf4ef84 100644
> --- a/arch/sparc64/kernel/smp.c
> +++ b/arch/sparc64/kernel/smp.c
> @@ -866,14 +866,21 @@ void smp_call_function_client(int irq, struct pt_regs *regs)
>  	void *info = call_data->info;
>  
>  	clear_softint(1 << irq);
> +
> +	irq_enter();
> +
> +	if (!call_data->wait) {
> +		/* let initiator proceed after getting data */
> +		atomic_inc(&call_data->finished);
> +	}
> +
> +	func(info);
> +
> +	irq_exit();
> +
>  	if (call_data->wait) {
>  		/* let initiator proceed only after completion */
> -		func(info);
> -		atomic_inc(&call_data->finished);
> -	} else {
> -		/* let initiator proceed after getting data */
>  		atomic_inc(&call_data->finished);
> -		func(info);
>  	}
>  }



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-25  7:57     ` David Miller
  2008-04-25  8:13       ` Peter Zijlstra
@ 2008-04-25  8:24       ` Peter Zijlstra
  2008-04-25  8:30         ` David Miller
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2008-04-25  8:24 UTC (permalink / raw)
  To: David Miller; +Cc: mingo, torvalds, linux-kernel, akpm, viro, alan, tglx

On Fri, 2008-04-25 at 00:57 -0700, David Miller wrote:

> diff --git a/arch/sparc64/kernel/smp.c b/arch/sparc64/kernel/smp.c
> index 524b889..bf4ef84 100644
> --- a/arch/sparc64/kernel/smp.c
> +++ b/arch/sparc64/kernel/smp.c
> @@ -866,14 +866,21 @@ void smp_call_function_client(int irq, struct pt_regs *regs)
>  	void *info = call_data->info;
>  
>  	clear_softint(1 << irq);
> +
> +	irq_enter();
> +
> +	if (!call_data->wait) {
> +		/* let initiator proceed after getting data */
> +		atomic_inc(&call_data->finished);
> +	}
> +
> +	func(info);
> +
> +	irq_exit();
> +
>  	if (call_data->wait) {
>  		/* let initiator proceed only after completion */
> -		func(info);
> -		atomic_inc(&call_data->finished);
> -	} else {
> -		/* let initiator proceed after getting data */
>  		atomic_inc(&call_data->finished);
> -		func(info);
>  	}
>  }


FWIW I do think this is a valid 'bug' fix in that the called user func()
can now see its from interrupt context. in_interrupt() would have
reported false due to the missing irq_enter()/irq_exit() - not sure if
any smp_call_function() relies on it though.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-25  8:24       ` Peter Zijlstra
@ 2008-04-25  8:30         ` David Miller
  0 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2008-04-25  8:30 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: mingo, torvalds, linux-kernel, akpm, viro, alan, tglx

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri, 25 Apr 2008 10:24:18 +0200

> FWIW I do think this is a valid 'bug' fix in that the called user func()
> can now see its from interrupt context. in_interrupt() would have
> reported false due to the missing irq_enter()/irq_exit() - not sure if
> any smp_call_function() relies on it though.

Understood.

I thought a bit and I remember that I did notice long ago that x86 did
these irq_enter() calls in SMP call function processing, but it seemed
like pure overhead at the time and wasn't actually needed.

Anyways, Peter can you give your state machine thing another shot?
Maybe it will help us see what is wrong either in the sparc64 code
or in the NOHZ/sched bits.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-25  7:48   ` Peter Zijlstra
  2008-04-25  7:57     ` David Miller
@ 2008-04-25 10:19     ` Peter Zijlstra
  2008-04-25 10:51       ` Ingo Molnar
  2008-04-27 18:05       ` Thomas Gleixner
  1 sibling, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2008-04-25 10:19 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, torvalds, linux-kernel, akpm, viro, alan, Thomas Gleixner

On Fri, 2008-04-25 at 09:48 +0200, Peter Zijlstra wrote:

>  a) 15934a37324f32e0fda633dc7984a671ea81cd75 does indeed fix a bug in
>     rq->clock; without that patch it compresses nohz time to a single
>     jiffie, so cpu_clock() which (without the above hack) is based on
>     rq->clock will be short on nohz time. This can 'hide' the clock jump
>     and thus hide false positives.
> 
> 
>  b) there is commit:
> 
> ---
> commit d3938204468dccae16be0099a2abf53db4ed0505
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Wed Nov 28 15:52:56 2007 +0100
>     softlockup: fix false positives on CONFIG_NOHZ
> 
>     David Miller reported soft lockup false-positives that trigger
>     on NOHZ due to CPUs idling for more than 10 seconds.
> 
>     The solution is touch the softlockup watchdog when we return from
>     idle. (by definition we are not 'locked up' when we were idle)
> 
>      http://bugzilla.kernel.org/show_bug.cgi?id=9409
> 
>     Reported-by: David Miller <davem@davemloft.net>
>     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>     Signed-off-by: Ingo Molnar <mingo@elte.hu>
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 27a2338..cb89fa8 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -133,6 +133,8 @@ void tick_nohz_update_jiffies(void)
>  	if (!ts->tick_stopped)
>  		return;
>  
> +	touch_softlockup_watchdog();
> +
>  	cpu_clear(cpu, nohz_cpu_mask);
>  	now = ktime_get();
> 
> ---
> 
>     which should 'fix' this problem.
> 
> 
>  c) there are 'IPI' handlers on SPARC64 that look like they can wake
>     the CPU from idle sleep but do not appear to call irq_enter() which
>     has the above patch's touch_softlock_watchdog() in its callchain.
> 
> tl0_irq1:       TRAP_IRQ(smp_call_function_client, 1)
> tl0_irq2:       TRAP_IRQ(smp_receive_signal_client, 2)
> tl0_irq3:       TRAP_IRQ(smp_penguin_jailcell, 3)
> tl0_irq4:       TRAP_IRQ(smp_new_mmu_context_version_client, 4)


OK, so David came up with the idea that it might be the reschedule IPI
(smp_receive_signal_client) that did the wakeup resulting in the
following scenario:

 <idle > 60s >
 <resched-IPI> -> nohz_restart() -> restart timer
                                 -> schedule()

 <run stuff>
 <timer> -> softlockup_tick() -> BUG!
 
doing irq_enter/exit() for smp_receive_signal_client() did indeed fix
the whole issue. (x86 also has this bug - its just darn hard to generate
60s+ nohz periods due to the shitty clocks/timers)

So per b) any nohz wake needs to be done with an interrupt _and_ all
such interrupts must pass through irq_enter().

As far as I can tell this is not nessecarily true (or desired from a
performance POV) for all platforms, imagine the core2 monitor/mwait idle
that wakes up because of a memory write. This doesn't require an
interrupt at all to wake up.

So, are we going to require all waking interrupts (IPIs and regular) to
do the irq_enter/exit() dance and add the perhaps unneeded overhead to
these paths and require the non-interrupt driven wake-ups like
monitor/mwait to do the touch_softlockup_watchdog() themselves?

Or,

Is Ingo's initial patch to make nohz_restart() also touch the softlockup
watchdog the best fix (now that we understand what happens)?




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-25 10:19     ` Peter Zijlstra
@ 2008-04-25 10:51       ` Ingo Molnar
  2008-04-25 20:07         ` David Miller
  2008-04-27 18:05       ` Thomas Gleixner
  1 sibling, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2008-04-25 10:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Miller, torvalds, linux-kernel, akpm, viro, alan,
	Thomas Gleixner

* Peter Zijlstra <peterz@infradead.org> wrote:

> OK, so David came up with the idea that it might be the reschedule IPI 
> (smp_receive_signal_client) that did the wakeup resulting in the 
> following scenario:
> 
>  <idle > 60s >
>  <resched-IPI> -> nohz_restart() -> restart timer
>                                  -> schedule()
> 
>  <run stuff>
>  <timer> -> softlockup_tick() -> BUG!
>  
> doing irq_enter/exit() for smp_receive_signal_client() did indeed fix
> the whole issue. (x86 also has this bug - its just darn hard to generate
> 60s+ nohz periods due to the shitty clocks/timers)
> 
> So per b) any nohz wake needs to be done with an interrupt _and_ all
> such interrupts must pass through irq_enter().
> 
> As far as I can tell this is not nessecarily true (or desired from a 
> performance POV) for all platforms, imagine the core2 monitor/mwait 
> idle that wakes up because of a memory write. This doesn't require an 
> interrupt at all to wake up.
> 
> So, are we going to require all waking interrupts (IPIs and regular) 
> to do the irq_enter/exit() dance and add the perhaps unneeded overhead 
> to these paths and require the non-interrupt driven wake-ups like 
> monitor/mwait to do the touch_softlockup_watchdog() themselves?
> 
> Or,
> 
> Is Ingo's initial patch to make nohz_restart() also touch the 
> softlockup watchdog the best fix (now that we understand what 
> happens)?

thanks Peter for the detailed analysis.

I think we should not require all IRQ contexts to do irq_enter/exit - 
not because of the overhead (these are mainly really small costs), but 
mostly because as this example has shown, nothing besides this rather 
elusive softlockup false-positive enforces their correctness. So we'll 
always lag behind significantly and wont ever achieve a truly correct 
solution.

a much stronger approach would be to: require all exit-idle paths to 
touch the softlockup watchdog (we are evidently not locked up there), 
and require the timer IRQ to touch it as well if it hits an idle 
context. We restart the scheduler tick in the exit-idle path anyway, 
otherwise the scheduler breaks visibly - so that's the right place to 
put the touch_softlockup_watchdog() call.

I.e. i'd pretty much concur with you that the best solution is the very 
first patch - which is also in the pull to Linus - we just now 
understand exactly why ;-) David, do you agree?

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-25 10:51       ` Ingo Molnar
@ 2008-04-25 20:07         ` David Miller
  0 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2008-04-25 20:07 UTC (permalink / raw)
  To: mingo; +Cc: peterz, torvalds, linux-kernel, akpm, viro, alan, tglx

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 25 Apr 2008 12:51:49 +0200

> I.e. i'd pretty much concur with you that the best solution is the very 
> first patch - which is also in the pull to Linus - we just now 
> understand exactly why ;-) David, do you agree?

I just want my machines working, so I wrapped all sparc64 IPI
calls in irq_{enter,exit}() and pushed that to Linus.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [git pull] scheduler/misc fixes
  2008-04-25 10:19     ` Peter Zijlstra
  2008-04-25 10:51       ` Ingo Molnar
@ 2008-04-27 18:05       ` Thomas Gleixner
  1 sibling, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2008-04-27 18:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Miller, mingo, torvalds, linux-kernel, akpm, viro, alan

On Fri, 25 Apr 2008, Peter Zijlstra wrote:
> So per b) any nohz wake needs to be done with an interrupt _and_ all
> such interrupts must pass through irq_enter().
> 
> As far as I can tell this is not nessecarily true (or desired from a
> performance POV) for all platforms, imagine the core2 monitor/mwait idle
> that wakes up because of a memory write. This doesn't require an
> interrupt at all to wake up.
> 
> So, are we going to require all waking interrupts (IPIs and regular) to
> do the irq_enter/exit() dance and add the perhaps unneeded overhead to
> these paths and require the non-interrupt driven wake-ups like
> monitor/mwait to do the touch_softlockup_watchdog() themselves?

regular irqs need to go through irq_enter()/exit() in any case.

IPIs are border line. As long as none of the IPI code relies on
jiffies, touches timers ... we don't. But we are on the safe side if
we do.

> Or,
> 
> Is Ingo's initial patch to make nohz_restart() also touch the softlockup
> watchdog the best fix (now that we understand what happens)?

Yes, we need it for mwait based wakeups which don't go through
interrupts at all.

Thanks,
	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-04-27 18:07 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-24 22:55 [git pull] scheduler/misc fixes Ingo Molnar
2008-04-25  3:46 ` David Miller
2008-04-25  7:48   ` Peter Zijlstra
2008-04-25  7:57     ` David Miller
2008-04-25  8:13       ` Peter Zijlstra
2008-04-25  8:24       ` Peter Zijlstra
2008-04-25  8:30         ` David Miller
2008-04-25 10:19     ` Peter Zijlstra
2008-04-25 10:51       ` Ingo Molnar
2008-04-25 20:07         ` David Miller
2008-04-27 18:05       ` Thomas Gleixner
2008-04-25  8:04   ` Ingo Molnar
2008-04-25  8:07     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox