public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
To: <mingo@redhat.com>, <peterz@infradead.org>
Cc: <kernellwp@gmail.com>, <riel@redhat.com>, <tkhai@yandex.ru>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2] sched/fair: Care divide error in update_task_scan_period()
Date: Mon, 20 Oct 2014 16:47:31 +0900	[thread overview]
Message-ID: <5444BE13.8050603@jp.fujitsu.com> (raw)
In-Reply-To: <543F945F.4020303@jp.fujitsu.com>

Could you review this patch?

(2014/10/16 18:48), Yasuaki Ishimatsu wrote:
> While offling node by hot removing memory, the following divide error
> occurs:
> 
>    divide error: 0000 [#1] SMP
>    [...]
>    Call Trace:
>     [...] handle_mm_fault
>     [...] ? try_to_wake_up
>     [...] ? wake_up_state
>     [...] __do_page_fault
>     [...] ? do_futex
>     [...] ? put_prev_entity
>     [...] ? __switch_to
>     [...] do_page_fault
>     [...] page_fault
>    [...]
>    RIP  [<ffffffff810a7081>] task_numa_fault
>     RSP <ffff88084eb2bcb0>
> 
> The issue occurs as follows:
>    1. When page fault occurs and page is allocated from node 1,
>       task_struct->numa_faults_buffer_memory[] of node 1 is
>       incremented and p->numa_faults_locality[] is also incremented
>       as follows:
> 
>       o numa_faults_buffer_memory[]       o numa_faults_locality[]
>                NR_NUMA_HINT_FAULT_TYPES
>               |      0     |     1     |
>       ----------------------------------  ----------------------
>        node 0 |      0     |     0     |   remote |      0     |
>        node 1 |      0     |     1     |   locale |      1     |
>       ----------------------------------  ----------------------
> 
>    2. node 1 is offlined by hot removing memory.
> 
>    3. When page fault occurs, fault_types[] is calculated by using
>       p->numa_faults_buffer_memory[] of all online nodes in
>       task_numa_placement(). But node 1 was offline by step 2. So
>       the fault_types[] is calculated by using only
>       p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
>       are set to 0.
> 
>    4. The values(0) of fault_types[] pass to update_task_scan_period().
> 
>    5. numa_faults_locality[1] is set to 1. So the following division is
>       calculated.
> 
>          static void update_task_scan_period(struct task_struct *p,
>                                  unsigned long shared, unsigned long private){
>          ...
>                  ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>          }
> 
>    6. But both of private and shared are set to 0. So divide error
>       occurs here.
> 
> The divide error is rare case because the trigger is node offline.
> By this patch, when both of private and shared are set to 0,
> denominator is set to 1 for avoiding divide error.
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> CC: Wanpeng Li <kernellwp@gmail.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Peter Zijlstra <peterz@infradead.org>
> ---
>   kernel/sched/fair.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfa3c86..580fc74 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1466,6 +1466,7 @@ static void update_task_scan_period(struct task_struct *p,
> 
>   	unsigned long remote = p->numa_faults_locality[0];
>   	unsigned long local = p->numa_faults_locality[1];
> +	unsigned long total_faults = shared + private;
> 
>   	/*
>   	 * If there were no record hinting faults then either the task is
> @@ -1496,6 +1497,14 @@ static void update_task_scan_period(struct task_struct *p,
>   			slot = 1;
>   		diff = slot * period_slot;
>   	} else {
> +		/*
> +		 * This is a rare case. total_faults might become 0 after
> +		 * offlining node. In this case, total_faults is set to 1
> +		 * for avoiding divide error.
> +		 */
> +		if (unlikely(total_faults == 0))
> +			total_faults = 1;
> +
>   		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
> 
>   		/*
> @@ -1506,7 +1515,7 @@ static void update_task_scan_period(struct task_struct *p,
>   		 * scanning faster if shared accesses dominate as it may
>   		 * simply bounce migrations uselessly
>   		 */
> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
> +		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (total_faults));
>   		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>   	}
> 



  reply	other threads:[~2014-10-20  7:48 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-16  9:48 [PATCH v2] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
2014-10-20  7:47 ` Yasuaki Ishimatsu [this message]
2014-10-21  9:21 ` Peter Zijlstra
2014-10-22  5:39   ` Yasuaki Ishimatsu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5444BE13.8050603@jp.fujitsu.com \
    --to=isimatu.yasuaki@jp.fujitsu.com \
    --cc=kernellwp@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tkhai@yandex.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox