[PATCH v3] sched/fair: Care divide error in update_task_scan

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3] sched/fair: Care divide error in update_task_scan_period()
@ 2014-10-22  7:04 Yasuaki Ishimatsu
  2014-10-22  7:09 ` Wanpeng Li
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Yasuaki Ishimatsu @ 2014-10-22  7:04 UTC (permalink / raw)
  To: mingo, peterz; +Cc: kernellwp, riel, tkhai, linux-kernel

While offling node by hot removing memory, the following divide error
occurs:

  divide error: 0000 [#1] SMP
  [...]
  Call Trace:
   [...] handle_mm_fault
   [...] ? try_to_wake_up
   [...] ? wake_up_state
   [...] __do_page_fault
   [...] ? do_futex
   [...] ? put_prev_entity
   [...] ? __switch_to
   [...] do_page_fault
   [...] page_fault
  [...]
  RIP  [<ffffffff810a7081>] task_numa_fault
   RSP <ffff88084eb2bcb0>

The issue occurs as follows:
  1. When page fault occurs and page is allocated from node 1,
     task_struct->numa_faults_buffer_memory[] of node 1 is
     incremented and p->numa_faults_locality[] is also incremented
     as follows:

     o numa_faults_buffer_memory[]       o numa_faults_locality[]
              NR_NUMA_HINT_FAULT_TYPES
             |      0     |     1     |
     ----------------------------------  ----------------------
      node 0 |      0     |     0     |   remote |      0     |
      node 1 |      0     |     1     |   locale |      1     |
     ----------------------------------  ----------------------

  2. node 1 is offlined by hot removing memory.

  3. When page fault occurs, fault_types[] is calculated by using
     p->numa_faults_buffer_memory[] of all online nodes in
     task_numa_placement(). But node 1 was offline by step 2. So
     the fault_types[] is calculated by using only
     p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
     are set to 0.

  4. The values(0) of fault_types[] pass to update_task_scan_period().

  5. numa_faults_locality[1] is set to 1. So the following division is
     calculated.

        static void update_task_scan_period(struct task_struct *p,
                                unsigned long shared, unsigned long private){
        ...
                ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
        }

  6. But both of private and shared are set to 0. So divide error
     occurs here.

The divide error is rare case because the trigger is node offline.
This patch always increments denominator for avoiding divide error.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
v2:
 - Simply increment a denominator

 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b069bf..f3b492d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1520,7 +1520,7 @@ static void update_task_scan_period(struct task_struct *p,
 		 * scanning faster if shared accesses dominate as it may
 		 * simply bounce migrations uselessly
 		 */
-		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
+		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared + 1));
 		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
 	}

-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] sched/fair: Care divide error in update_task_scan_period()
  2014-10-22  7:04 [PATCH v3] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
@ 2014-10-22  7:09 ` Wanpeng Li
  2014-10-22  7:41   ` Yasuaki Ishimatsu
  2014-10-27  5:46 ` Yasuaki Ishimatsu
  2014-10-28 11:02 ` [tip:sched/core] " tip-bot for Yasuaki Ishimatsu
  2 siblings, 1 reply; 5+ messages in thread
From: Wanpeng Li @ 2014-10-22  7:09 UTC (permalink / raw)
  To: Yasuaki Ishimatsu, mingo, peterz; +Cc: riel, tkhai, linux-kernel

10/22/14, 3:04 PM, Yasuaki Ishimatsu:
> While offling node by hot removing memory, the following divide error
> occurs:
>
>   divide error: 0000 [#1] SMP
>   [...]
>   Call Trace:
>    [...] handle_mm_fault
>    [...] ? try_to_wake_up
>    [...] ? wake_up_state
>    [...] __do_page_fault
>    [...] ? do_futex
>    [...] ? put_prev_entity
>    [...] ? __switch_to
>    [...] do_page_fault
>    [...] page_fault
>   [...]
>   RIP  [<ffffffff810a7081>] task_numa_fault
>    RSP <ffff88084eb2bcb0>
>
> The issue occurs as follows:
>   1. When page fault occurs and page is allocated from node 1,
>      task_struct->numa_faults_buffer_memory[] of node 1 is
>      incremented and p->numa_faults_locality[] is also incremented
>      as follows:
>
>      o numa_faults_buffer_memory[]       o numa_faults_locality[]
>               NR_NUMA_HINT_FAULT_TYPES
>              |      0     |     1     |
>      ----------------------------------  ----------------------
>       node 0 |      0     |     0     |   remote |      0     |
>       node 1 |      0     |     1     |   locale |      1     |
>      ----------------------------------  ----------------------
>
>   2. node 1 is offlined by hot removing memory.
>
>   3. When page fault occurs, fault_types[] is calculated by using
>      p->numa_faults_buffer_memory[] of all online nodes in
>      task_numa_placement(). But node 1 was offline by step 2. So
>      the fault_types[] is calculated by using only
>      p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
>      are set to 0.
>
>   4. The values(0) of fault_types[] pass to update_task_scan_period().
>
>   5. numa_faults_locality[1] is set to 1. So the following division is
>      calculated.
>
>         static void update_task_scan_period(struct task_struct *p,
>                                 unsigned long shared, unsigned long private){
>         ...
>                 ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>         }
>
>   6. But both of private and shared are set to 0. So divide error
>      occurs here.
>
> The divide error is rare case because the trigger is node offline.
> This patch always increments denominator for avoiding divide error.
>
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>

> ---
> v2:
>  - Simply increment a denominator
>
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0b069bf..f3b492d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1520,7 +1520,7 @@ static void update_task_scan_period(struct task_struct *p,
>  		 * scanning faster if shared accesses dominate as it may
>  		 * simply bounce migrations uselessly
>  		 */
> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
> +		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared + 1));
>  		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>  	}
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] sched/fair: Care divide error in update_task_scan_period()
  2014-10-22  7:09 ` Wanpeng Li
@ 2014-10-22  7:41   ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 5+ messages in thread
From: Yasuaki Ishimatsu @ 2014-10-22  7:41 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: mingo, peterz, riel, tkhai, linux-kernel

(2014/10/22 16:09), Wanpeng Li wrote:
> 10/22/14, 3:04 PM, Yasuaki Ishimatsu:
>> While offling node by hot removing memory, the following divide error
>> occurs:
>>
>>    divide error: 0000 [#1] SMP
>>    [...]
>>    Call Trace:
>>     [...] handle_mm_fault
>>     [...] ? try_to_wake_up
>>     [...] ? wake_up_state
>>     [...] __do_page_fault
>>     [...] ? do_futex
>>     [...] ? put_prev_entity
>>     [...] ? __switch_to
>>     [...] do_page_fault
>>     [...] page_fault
>>    [...]
>>    RIP  [<ffffffff810a7081>] task_numa_fault
>>     RSP <ffff88084eb2bcb0>
>>
>> The issue occurs as follows:
>>    1. When page fault occurs and page is allocated from node 1,
>>       task_struct->numa_faults_buffer_memory[] of node 1 is
>>       incremented and p->numa_faults_locality[] is also incremented
>>       as follows:
>>
>>       o numa_faults_buffer_memory[]       o numa_faults_locality[]
>>                NR_NUMA_HINT_FAULT_TYPES
>>               |      0     |     1     |
>>       ----------------------------------  ----------------------
>>        node 0 |      0     |     0     |   remote |      0     |
>>        node 1 |      0     |     1     |   locale |      1     |
>>       ----------------------------------  ----------------------
>>
>>    2. node 1 is offlined by hot removing memory.
>>
>>    3. When page fault occurs, fault_types[] is calculated by using
>>       p->numa_faults_buffer_memory[] of all online nodes in
>>       task_numa_placement(). But node 1 was offline by step 2. So
>>       the fault_types[] is calculated by using only
>>       p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
>>       are set to 0.
>>
>>    4. The values(0) of fault_types[] pass to update_task_scan_period().
>>
>>    5. numa_faults_locality[1] is set to 1. So the following division is
>>       calculated.
>>
>>          static void update_task_scan_period(struct task_struct *p,
>>                                  unsigned long shared, unsigned long private){
>>          ...
>>                  ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>>          }
>>
>>    6. But both of private and shared are set to 0. So divide error
>>       occurs here.
>>
>> The divide error is rare case because the trigger is node offline.
>> This patch always increments denominator for avoiding divide error.
>>
>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>

Thank you for your review.

Thanks,
Yasuaki Ishimatsu

> 
>> ---
>> v2:
>>   - Simply increment a denominator
>>
>>   kernel/sched/fair.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0b069bf..f3b492d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1520,7 +1520,7 @@ static void update_task_scan_period(struct task_struct *p,
>>   		 * scanning faster if shared accesses dominate as it may
>>   		 * simply bounce migrations uselessly
>>   		 */
>> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>> +		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared + 1));
>>   		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>>   	}
>>
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] sched/fair: Care divide error in update_task_scan_period()
  2014-10-22  7:04 [PATCH v3] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
  2014-10-22  7:09 ` Wanpeng Li
@ 2014-10-27  5:46 ` Yasuaki Ishimatsu
  2014-10-28 11:02 ` [tip:sched/core] " tip-bot for Yasuaki Ishimatsu
  2 siblings, 0 replies; 5+ messages in thread
From: Yasuaki Ishimatsu @ 2014-10-27  5:46 UTC (permalink / raw)
  To: mingo, peterz; +Cc: kernellwp, riel, tkhai, linux-kernel

Could you review my patch?

Thanks,
Yasuaki Ishimatsu

(2014/10/22 16:04), Yasuaki Ishimatsu wrote:
> While offling node by hot removing memory, the following divide error
> occurs:
> 
>    divide error: 0000 [#1] SMP
>    [...]
>    Call Trace:
>     [...] handle_mm_fault
>     [...] ? try_to_wake_up
>     [...] ? wake_up_state
>     [...] __do_page_fault
>     [...] ? do_futex
>     [...] ? put_prev_entity
>     [...] ? __switch_to
>     [...] do_page_fault
>     [...] page_fault
>    [...]
>    RIP  [<ffffffff810a7081>] task_numa_fault
>     RSP <ffff88084eb2bcb0>
> 
> The issue occurs as follows:
>    1. When page fault occurs and page is allocated from node 1,
>       task_struct->numa_faults_buffer_memory[] of node 1 is
>       incremented and p->numa_faults_locality[] is also incremented
>       as follows:
> 
>       o numa_faults_buffer_memory[]       o numa_faults_locality[]
>                NR_NUMA_HINT_FAULT_TYPES
>               |      0     |     1     |
>       ----------------------------------  ----------------------
>        node 0 |      0     |     0     |   remote |      0     |
>        node 1 |      0     |     1     |   locale |      1     |
>       ----------------------------------  ----------------------
> 
>    2. node 1 is offlined by hot removing memory.
> 
>    3. When page fault occurs, fault_types[] is calculated by using
>       p->numa_faults_buffer_memory[] of all online nodes in
>       task_numa_placement(). But node 1 was offline by step 2. So
>       the fault_types[] is calculated by using only
>       p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
>       are set to 0.
> 
>    4. The values(0) of fault_types[] pass to update_task_scan_period().
> 
>    5. numa_faults_locality[1] is set to 1. So the following division is
>       calculated.
> 
>          static void update_task_scan_period(struct task_struct *p,
>                                  unsigned long shared, unsigned long private){
>          ...
>                  ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>          }
> 
>    6. But both of private and shared are set to 0. So divide error
>       occurs here.
> 
> The divide error is rare case because the trigger is node offline.
> This patch always increments denominator for avoiding divide error.
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> ---
> v2:
>   - Simply increment a denominator
> 
>   kernel/sched/fair.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0b069bf..f3b492d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1520,7 +1520,7 @@ static void update_task_scan_period(struct task_struct *p,
>   		 * scanning faster if shared accesses dominate as it may
>   		 * simply bounce migrations uselessly
>   		 */
> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
> +		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared + 1));
>   		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>   	}
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [tip:sched/core] sched/fair: Care divide error in update_task_scan_period()
  2014-10-22  7:04 [PATCH v3] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
  2014-10-22  7:09 ` Wanpeng Li
  2014-10-27  5:46 ` Yasuaki Ishimatsu
@ 2014-10-28 11:02 ` tip-bot for Yasuaki Ishimatsu
  2 siblings, 0 replies; 5+ messages in thread
From: tip-bot for Yasuaki Ishimatsu @ 2014-10-28 11:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, hpa, torvalds, tglx, linux-kernel, isimatu.yasuaki, mingo

Commit-ID:  2847c90e1b3ae95379af24894fc4f98e7f2fd705
Gitweb:     http://git.kernel.org/tip/2847c90e1b3ae95379af24894fc4f98e7f2fd705
Author:     Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
AuthorDate: Wed, 22 Oct 2014 16:04:35 +0900
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 28 Oct 2014 10:46:03 +0100

sched/fair: Care divide error in update_task_scan_period()

While offling node by hot removing memory, the following divide error
occurs:

  divide error: 0000 [#1] SMP
  [...]
  Call Trace:
   [...] handle_mm_fault
   [...] ? try_to_wake_up
   [...] ? wake_up_state
   [...] __do_page_fault
   [...] ? do_futex
   [...] ? put_prev_entity
   [...] ? __switch_to
   [...] do_page_fault
   [...] page_fault
  [...]
  RIP  [<ffffffff810a7081>] task_numa_fault
   RSP <ffff88084eb2bcb0>

The issue occurs as follows:
  1. When page fault occurs and page is allocated from node 1,
     task_struct->numa_faults_buffer_memory[] of node 1 is
     incremented and p->numa_faults_locality[] is also incremented
     as follows:

     o numa_faults_buffer_memory[]       o numa_faults_locality[]
              NR_NUMA_HINT_FAULT_TYPES
             |      0     |     1     |
     ----------------------------------  ----------------------
      node 0 |      0     |     0     |   remote |      0     |
      node 1 |      0     |     1     |   locale |      1     |
     ----------------------------------  ----------------------

  2. node 1 is offlined by hot removing memory.

  3. When page fault occurs, fault_types[] is calculated by using
     p->numa_faults_buffer_memory[] of all online nodes in
     task_numa_placement(). But node 1 was offline by step 2. So
     the fault_types[] is calculated by using only
     p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
     are set to 0.

  4. The values(0) of fault_types[] pass to update_task_scan_period().

  5. numa_faults_locality[1] is set to 1. So the following division is
     calculated.

        static void update_task_scan_period(struct task_struct *p,
                                unsigned long shared, unsigned long private){
        ...
                ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
        }

  6. But both of private and shared are set to 0. So divide error
     occurs here.

The divide error is rare case because the trigger is node offline.
This patch always increments denominator for avoiding divide error.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbc0b82..e9abd4e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1530,7 +1530,7 @@ static void update_task_scan_period(struct task_struct *p,
 		 * scanning faster if shared accesses dominate as it may
 		 * simply bounce migrations uselessly
 		 */
-		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
+		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared + 1));
 		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
 	}
 

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-10-28 11:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-22  7:04 [PATCH v3] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
2014-10-22  7:09 ` Wanpeng Li
2014-10-22  7:41   ` Yasuaki Ishimatsu
2014-10-27  5:46 ` Yasuaki Ishimatsu
2014-10-28 11:02 ` [tip:sched/core] " tip-bot for Yasuaki Ishimatsu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).