sched: race between deactivate and switch sched

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* sched: race between deactivate and switch sched_info accounting?
@ 2009-10-10  2:40 Paul Turner
  2009-10-10  6:37 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: Paul Turner @ 2009-10-10  2:40 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel

Hi,

While chasing down some performance we found inconsistenies in the 
accounting of run-delay.  Tracking this down I think there exists a race 
condition in which a task can 're-arrive' without updating its queuing 
accounting.  This can lead to incorrect run_delay and rq_cpu_time 
accounting.

Consider a flow such as:

1. task p blocks, enters schedule()
2. we deactivate p
3. there's nothing else so we load balance
4. during load balance we lock balance and a remote cpu succeeds in 
re-activating p
5. we then re-pick next==p
6. we skip the accounting update since prev==next, start running p again
   (last_arrival is left polluted with our prior arrival) 

<time passes>

7. enter schedule() again, switch into new task
   a. if p blocks then we'll go through deactivate which then will 
   charge the last period as run_delay even though we were actually 
   running.
   b. if p doesn't block then we'll leave last_queued at the stale prior 
   value since last_queued != 0 in sched_info_queued.

   [*] in depart our delta will cover the span since our original arrival 
   including some (probably largely inconsequential) extra time that
   wouldn't normally be charged to us.

<time passes>

8. pick p to run again
   if 7b was true then the queue delay we charge here will again include 
   p's prior timeslice that was spent running.

One way to work around this is to charge a switch event which accounts for 
the fact that the task technically did temporarily depart and re-arrive.  
A patch for this approach is below but there are definitely other ways to 
handle it (there may also be additional concerns as further accounting 
continues to be added).

Thanks,

- Paul

 ---

It's possible for our previously de-activated task to be re-activated by a 
remote cpu during lock balancing.  We have to account for this manually 
since prev == next, yet the task just went through dequeue accounting.

 Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched.c |   15 ++++++++++++---
 1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ee61f45..6445d9d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5381,7 +5381,7 @@ asmlinkage void __sched schedule(void)
 	struct task_struct *prev, *next;
 	unsigned long *switch_count;
 	struct rq *rq;
-	int cpu;
+	int cpu, deactivated_prev = 0;

 need_resched:
 	preempt_disable();
@@ -5406,8 +5406,10 @@ need_resched_nonpreemptible:
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
 		if (unlikely(signal_pending_state(prev->state, prev)))
 			prev->state = TASK_RUNNING;
-		else
+		else {
 			deactivate_task(rq, prev, 1);
+			deactivated_prev = 1;
+		}
 		switch_count = &prev->nvcsw;
 	}

@@ -5434,8 +5436,15 @@ need_resched_nonpreemptible:
 		 */
 		cpu = smp_processor_id();
 		rq = cpu_rq(cpu);
-	} else
+	} else {
+		/*
+		 * account for our previous task being re-activated by a
+		 * remote cpu.
+		 */
+		if (unlikely(deactivated_prev))
+			sched_info_switch(prev, prev);
 		spin_unlock_irq(&rq->lock);
+	}

 	post_schedule(rq);

-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: sched: race between deactivate and switch sched_info accounting?
  2009-10-10  2:40 sched: race between deactivate and switch sched_info accounting? Paul Turner
@ 2009-10-10  6:37 ` Peter Zijlstra
  2009-10-12 21:20   ` Paul Turner
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2009-10-10  6:37 UTC (permalink / raw)
  To: Paul Turner; +Cc: mingo, linux-kernel

On Fri, 2009-10-09 at 19:40 -0700, Paul Turner wrote:

This looks very funny, I would expect that whoever does activate() on
that task to do the sched_info*() muck?

The below patch looks very asymmetric in that regard.

> It's possible for our previously de-activated task to be re-activated by a 
> remote cpu during lock balancing.  We have to account for this manually 
> since prev == next, yet the task just went through dequeue accounting.
> 
>  Signed-off-by: Paul Turner <pjt@google.com>
> ---
>  kernel/sched.c |   15 ++++++++++++---
>  1 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index ee61f45..6445d9d 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5381,7 +5381,7 @@ asmlinkage void __sched schedule(void)
>  	struct task_struct *prev, *next;
>  	unsigned long *switch_count;
>  	struct rq *rq;
> -	int cpu;
> +	int cpu, deactivated_prev = 0;
>  
>  need_resched:
>  	preempt_disable();
> @@ -5406,8 +5406,10 @@ need_resched_nonpreemptible:
>  	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
>  		if (unlikely(signal_pending_state(prev->state, prev)))
>  			prev->state = TASK_RUNNING;
> -		else
> +		else {
>  			deactivate_task(rq, prev, 1);
> +			deactivated_prev = 1;
> +		}
>  		switch_count = &prev->nvcsw;
>  	}
>  
> @@ -5434,8 +5436,15 @@ need_resched_nonpreemptible:
>  		 */
>  		cpu = smp_processor_id();
>  		rq = cpu_rq(cpu);
> -	} else
> +	} else {
> +		/*
> +		 * account for our previous task being re-activated by a
> +		 * remote cpu.
> +		 */
> +		if (unlikely(deactivated_prev))
> +			sched_info_switch(prev, prev);
>  		spin_unlock_irq(&rq->lock);
> +	}
>  
>  	post_schedule(rq);
>  

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: sched: race between deactivate and switch sched_info accounting?
  2009-10-10  6:37 ` Peter Zijlstra
@ 2009-10-12 21:20   ` Paul Turner
  0 siblings, 0 replies; 3+ messages in thread
From: Paul Turner @ 2009-10-12 21:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel

On Fri, Oct 9, 2009 at 11:37 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2009-10-09 at 19:40 -0700, Paul Turner wrote:
>
> This looks very funny, I would expect that whoever does activate() on
> that task to do the sched_info*() muck?

They do their part, the problem stems from there being 2 halves of
sched_info muck to deal with:

then enqeue/dequeue tracking (which is called out of the activate path)

then the arrive/depart (which is hinged off the task switch in schedule())

These steps share state in last_queued; we are accounting the
enqeue/dequeue properly but because prev==next we skip the second half
of the accounting and end up out of sync.

>
> The below patch looks very asymmetric in that regard.
>

the idea below is to fix things up by 'inserting' the task->idle->task
switch we'd have had if we didn't race in wake-up

this approach might look more natural if the first info_switch was
just pulled above the actual switch logic with this check; another
option might be to force a last_queued reset on depart (since it's
really only the run_delay skew that's significant)

>> It's possible for our previously de-activated task to be re-activated by a
>> remote cpu during lock balancing.  We have to account for this manually
>> since prev == next, yet the task just went through dequeue accounting.
>>
>>  Signed-off-by: Paul Turner <pjt@google.com>
>> ---
>>  kernel/sched.c |   15 ++++++++++++---
>>  1 files changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index ee61f45..6445d9d 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -5381,7 +5381,7 @@ asmlinkage void __sched schedule(void)
>>       struct task_struct *prev, *next;
>>       unsigned long *switch_count;
>>       struct rq *rq;
>> -     int cpu;
>> +     int cpu, deactivated_prev = 0;
>>
>>  need_resched:
>>       preempt_disable();
>> @@ -5406,8 +5406,10 @@ need_resched_nonpreemptible:
>>       if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
>>               if (unlikely(signal_pending_state(prev->state, prev)))
>>                       prev->state = TASK_RUNNING;
>> -             else
>> +             else {
>>                       deactivate_task(rq, prev, 1);
>> +                     deactivated_prev = 1;
>> +             }
>>               switch_count = &prev->nvcsw;
>>       }
>>
>> @@ -5434,8 +5436,15 @@ need_resched_nonpreemptible:
>>                */
>>               cpu = smp_processor_id();
>>               rq = cpu_rq(cpu);
>> -     } else
>> +     } else {
>> +             /*
>> +              * account for our previous task being re-activated by a
>> +              * remote cpu.
>> +              */
>> +             if (unlikely(deactivated_prev))
>> +                     sched_info_switch(prev, prev);
>>               spin_unlock_irq(&rq->lock);
>> +     }
>>
>>       post_schedule(rq);
>>
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-10-12 21:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-10  2:40 sched: race between deactivate and switch sched_info accounting? Paul Turner
2009-10-10  6:37 ` Peter Zijlstra
2009-10-12 21:20   ` Paul Turner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox