[PATCH v2] sched/fair: Care divide error in update_task_scan

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] sched/fair: Care divide error in update_task_scan_period()
@ 2014-10-16  9:48 Yasuaki Ishimatsu
  2014-10-20  7:47 ` Yasuaki Ishimatsu
  2014-10-21  9:21 ` Peter Zijlstra
  0 siblings, 2 replies; 4+ messages in thread
From: Yasuaki Ishimatsu @ 2014-10-16  9:48 UTC (permalink / raw)
  To: mingo, peterz; +Cc: kernellwp, riel, tkhai, linux-kernel

While offling node by hot removing memory, the following divide error
occurs:

  divide error: 0000 [#1] SMP
  [...]
  Call Trace:
   [...] handle_mm_fault
   [...] ? try_to_wake_up
   [...] ? wake_up_state
   [...] __do_page_fault
   [...] ? do_futex
   [...] ? put_prev_entity
   [...] ? __switch_to
   [...] do_page_fault
   [...] page_fault
  [...]
  RIP  [<ffffffff810a7081>] task_numa_fault
   RSP <ffff88084eb2bcb0>

The issue occurs as follows:
  1. When page fault occurs and page is allocated from node 1,
     task_struct->numa_faults_buffer_memory[] of node 1 is
     incremented and p->numa_faults_locality[] is also incremented
     as follows:

     o numa_faults_buffer_memory[]       o numa_faults_locality[]
              NR_NUMA_HINT_FAULT_TYPES
             |      0     |     1     |
     ----------------------------------  ----------------------
      node 0 |      0     |     0     |   remote |      0     |
      node 1 |      0     |     1     |   locale |      1     |
     ----------------------------------  ----------------------

  2. node 1 is offlined by hot removing memory.

  3. When page fault occurs, fault_types[] is calculated by using
     p->numa_faults_buffer_memory[] of all online nodes in
     task_numa_placement(). But node 1 was offline by step 2. So
     the fault_types[] is calculated by using only
     p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
     are set to 0.

  4. The values(0) of fault_types[] pass to update_task_scan_period().

  5. numa_faults_locality[1] is set to 1. So the following division is
     calculated.

        static void update_task_scan_period(struct task_struct *p,
                                unsigned long shared, unsigned long private){
        ...
                ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
        }

  6. But both of private and shared are set to 0. So divide error
     occurs here.

The divide error is rare case because the trigger is node offline.
By this patch, when both of private and shared are set to 0,
denominator is set to 1 for avoiding divide error.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
CC: Wanpeng Li <kernellwp@gmail.com>
CC: Rik van Riel <riel@redhat.com>
CC: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa3c86..580fc74 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1466,6 +1466,7 @@ static void update_task_scan_period(struct task_struct *p,

 	unsigned long remote = p->numa_faults_locality[0];
 	unsigned long local = p->numa_faults_locality[1];
+	unsigned long total_faults = shared + private;

 	/*
 	 * If there were no record hinting faults then either the task is
@@ -1496,6 +1497,14 @@ static void update_task_scan_period(struct task_struct *p,
 			slot = 1;
 		diff = slot * period_slot;
 	} else {
+		/*
+		 * This is a rare case. total_faults might become 0 after
+		 * offlining node. In this case, total_faults is set to 1
+		 * for avoiding divide error.
+		 */
+		if (unlikely(total_faults == 0))
+			total_faults = 1;
+
 		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;

 		/*
@@ -1506,7 +1515,7 @@ static void update_task_scan_period(struct task_struct *p,
 		 * scanning faster if shared accesses dominate as it may
 		 * simply bounce migrations uselessly
 		 */
-		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
+		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (total_faults));
 		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
 	}

-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/fair: Care divide error in update_task_scan_period()
  2014-10-16  9:48 [PATCH v2] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
@ 2014-10-20  7:47 ` Yasuaki Ishimatsu
  2014-10-21  9:21 ` Peter Zijlstra
  1 sibling, 0 replies; 4+ messages in thread
From: Yasuaki Ishimatsu @ 2014-10-20  7:47 UTC (permalink / raw)
  To: mingo, peterz; +Cc: kernellwp, riel, tkhai, linux-kernel

Could you review this patch?

(2014/10/16 18:48), Yasuaki Ishimatsu wrote:
> While offling node by hot removing memory, the following divide error
> occurs:
> 
>    divide error: 0000 [#1] SMP
>    [...]
>    Call Trace:
>     [...] handle_mm_fault
>     [...] ? try_to_wake_up
>     [...] ? wake_up_state
>     [...] __do_page_fault
>     [...] ? do_futex
>     [...] ? put_prev_entity
>     [...] ? __switch_to
>     [...] do_page_fault
>     [...] page_fault
>    [...]
>    RIP  [<ffffffff810a7081>] task_numa_fault
>     RSP <ffff88084eb2bcb0>
> 
> The issue occurs as follows:
>    1. When page fault occurs and page is allocated from node 1,
>       task_struct->numa_faults_buffer_memory[] of node 1 is
>       incremented and p->numa_faults_locality[] is also incremented
>       as follows:
> 
>       o numa_faults_buffer_memory[]       o numa_faults_locality[]
>                NR_NUMA_HINT_FAULT_TYPES
>               |      0     |     1     |
>       ----------------------------------  ----------------------
>        node 0 |      0     |     0     |   remote |      0     |
>        node 1 |      0     |     1     |   locale |      1     |
>       ----------------------------------  ----------------------
> 
>    2. node 1 is offlined by hot removing memory.
> 
>    3. When page fault occurs, fault_types[] is calculated by using
>       p->numa_faults_buffer_memory[] of all online nodes in
>       task_numa_placement(). But node 1 was offline by step 2. So
>       the fault_types[] is calculated by using only
>       p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
>       are set to 0.
> 
>    4. The values(0) of fault_types[] pass to update_task_scan_period().
> 
>    5. numa_faults_locality[1] is set to 1. So the following division is
>       calculated.
> 
>          static void update_task_scan_period(struct task_struct *p,
>                                  unsigned long shared, unsigned long private){
>          ...
>                  ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>          }
> 
>    6. But both of private and shared are set to 0. So divide error
>       occurs here.
> 
> The divide error is rare case because the trigger is node offline.
> By this patch, when both of private and shared are set to 0,
> denominator is set to 1 for avoiding divide error.
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> CC: Wanpeng Li <kernellwp@gmail.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> ---
>   kernel/sched/fair.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfa3c86..580fc74 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1466,6 +1466,7 @@ static void update_task_scan_period(struct task_struct *p,
> 
>   	unsigned long remote = p->numa_faults_locality[0];
>   	unsigned long local = p->numa_faults_locality[1];
> +	unsigned long total_faults = shared + private;
> 
>   	/*
>   	 * If there were no record hinting faults then either the task is
> @@ -1496,6 +1497,14 @@ static void update_task_scan_period(struct task_struct *p,
>   			slot = 1;
>   		diff = slot * period_slot;
>   	} else {
> +		/*
> +		 * This is a rare case. total_faults might become 0 after
> +		 * offlining node. In this case, total_faults is set to 1
> +		 * for avoiding divide error.
> +		 */
> +		if (unlikely(total_faults == 0))
> +			total_faults = 1;
> +
>   		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
> 
>   		/*
> @@ -1506,7 +1515,7 @@ static void update_task_scan_period(struct task_struct *p,
>   		 * scanning faster if shared accesses dominate as it may
>   		 * simply bounce migrations uselessly
>   		 */
> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
> +		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (total_faults));
>   		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>   	}
> 



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/fair: Care divide error in update_task_scan_period()
  2014-10-16  9:48 [PATCH v2] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
  2014-10-20  7:47 ` Yasuaki Ishimatsu
@ 2014-10-21  9:21 ` Peter Zijlstra
  2014-10-22  5:39   ` Yasuaki Ishimatsu
  1 sibling, 1 reply; 4+ messages in thread
From: Peter Zijlstra @ 2014-10-21  9:21 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: mingo, kernellwp, riel, tkhai, linux-kernel

On Thu, Oct 16, 2014 at 06:48:15PM +0900, Yasuaki Ishimatsu wrote:
> +++ b/kernel/sched/fair.c
> @@ -1466,6 +1466,7 @@ static void update_task_scan_period(struct task_struct *p,
> 
>  	unsigned long remote = p->numa_faults_locality[0];
>  	unsigned long local = p->numa_faults_locality[1];
> +	unsigned long total_faults = shared + private;
> 
>  	/*
>  	 * If there were no record hinting faults then either the task is
> @@ -1496,6 +1497,14 @@ static void update_task_scan_period(struct task_struct *p,
>  			slot = 1;
>  		diff = slot * period_slot;
>  	} else {
> +		/*
> +		 * This is a rare case. total_faults might become 0 after
> +		 * offlining node. In this case, total_faults is set to 1
> +		 * for avoiding divide error.
> +		 */
> +		if (unlikely(total_faults == 0))
> +			total_faults = 1;
> +
>  		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
> 
>  		/*
> @@ -1506,7 +1515,7 @@ static void update_task_scan_period(struct task_struct *p,
>  		 * scanning faster if shared accesses dominate as it may
>  		 * simply bounce migrations uselessly
>  		 */
> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
> +		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (total_faults));
>  		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;

So what was wrong with the 'normal' unconditional +1 approach? Also
you've got superfluous parenthese.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/fair: Care divide error in update_task_scan_period()
  2014-10-21  9:21 ` Peter Zijlstra
@ 2014-10-22  5:39   ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 4+ messages in thread
From: Yasuaki Ishimatsu @ 2014-10-22  5:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, kernellwp, riel, tkhai, linux-kernel

(2014/10/21 18:21), Peter Zijlstra wrote:
> On Thu, Oct 16, 2014 at 06:48:15PM +0900, Yasuaki Ishimatsu wrote:
>> +++ b/kernel/sched/fair.c
>> @@ -1466,6 +1466,7 @@ static void update_task_scan_period(struct task_struct *p,
>>
>>   	unsigned long remote = p->numa_faults_locality[0];
>>   	unsigned long local = p->numa_faults_locality[1];
>> +	unsigned long total_faults = shared + private;
>>
>>   	/*
>>   	 * If there were no record hinting faults then either the task is
>> @@ -1496,6 +1497,14 @@ static void update_task_scan_period(struct task_struct *p,
>>   			slot = 1;
>>   		diff = slot * period_slot;
>>   	} else {
>> +		/*
>> +		 * This is a rare case. total_faults might become 0 after
>> +		 * offlining node. In this case, total_faults is set to 1
>> +		 * for avoiding divide error.
>> +		 */
>> +		if (unlikely(total_faults == 0))
>> +			total_faults = 1;
>> +
>>   		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>>
>>   		/*
>> @@ -1506,7 +1515,7 @@ static void update_task_scan_period(struct task_struct *p,
>>   		 * scanning faster if shared accesses dominate as it may
>>   		 * simply bounce migrations uselessly
>>   		 */
>> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>> +		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (total_faults));
>>   		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>

> So what was wrong with the 'normal' unconditional +1 approach? Also
> you've got superfluous parenthese.
>

When (private + shared) was not 0, I did not want to change behavior of
update_task_scan_period(). But I understood your comment. I'll update it.

Thanks,
Yasuaki Ishimatsu


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-10-22  5:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-16  9:48 [PATCH v2] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
2014-10-20  7:47 ` Yasuaki Ishimatsu
2014-10-21  9:21 ` Peter Zijlstra
2014-10-22  5:39   ` Yasuaki Ishimatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox