public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
To: <mingo@redhat.com>, <peterz@infradead.org>
Cc: <linux-kernel@vger.kernel.org>, <riel@redhat.com>, <tkhai@yandex.ru>
Subject: [PATCH] sched/fair: Care divide error in update_task_scan_period()
Date: Wed, 8 Oct 2014 15:43:11 +0900	[thread overview]
Message-ID: <5434DCFF.1040208@jp.fujitsu.com> (raw)

While offling node by hot removing memory, the following divide error
occurs:

  divide error: 0000 [#1] SMP
  [...]
  Call Trace:
   [...] handle_mm_fault
   [...] ? try_to_wake_up
   [...] ? wake_up_state
   [...] __do_page_fault
   [...] ? do_futex
   [...] ? put_prev_entity
   [...] ? __switch_to
   [...] do_page_fault
   [...] page_fault
  [...]
  RIP  [<ffffffff810a7081>] task_numa_fault
   RSP <ffff88084eb2bcb0>

The issue occurs as follows:
  1. When page fault occurs and page is allocated from node 1,
     task_struct->numa_faults_buffer_memory[] of node 1 is
     incremented and p->numa_faults_locality[] is also incremented
     as follows:

     o numa_faults_buffer_memory[]       o numa_faults_locality[]
              NR_NUMA_HINT_FAULT_TYPES
             |      0     |     1     |
     ----------------------------------  ----------------------
      node 0 |      0     |     0     |   remote |      0     |
      node 1 |      0     |     1     |   locale |      1     |
     ----------------------------------  ----------------------

  2. node 1 is offlined by hot removing memory.

  3. When page fault occurs, fault_types[] is calculated by using
     p->numa_faults_buffer_memory[] of all online nodes in
     task_numa_placement(). But node 1 was offline by step 2. So
     the fault_types[] is calculated by using only
     p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
     are set to 0.

  4. The values(0) of fault_types[] pass to update_task_scan_period().

  5. numa_faults_locality[1] is set to 1. So the following division is
     calculated.

        static void update_task_scan_period(struct task_struct *p,
                                unsigned long shared, unsigned long private){
        ...
                ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
        }

  6. But both of private and shared are set to 0. So divide error
     occurs here.

  The divide error is rare case because the trigger is node offline.
  By this patch, when both of private and shared are set to 0, diff
  is just set to 0, not calculating the division.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 kernel/sched/fair.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa3c86..fb7dc3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1496,18 +1496,26 @@ static void update_task_scan_period(struct task_struct *p,
 			slot = 1;
 		diff = slot * period_slot;
 	} else {
-		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
+		if (unlikely((private + shared) == 0))
+			/*
+			 * This is a rare case. The trigger is node offline.
+			 */
+			diff = 0;
+		else {
+			diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;

-		/*
-		 * Scale scan rate increases based on sharing. There is an
-		 * inverse relationship between the degree of sharing and
-		 * the adjustment made to the scanning period. Broadly
-		 * speaking the intent is that there is little point
-		 * scanning faster if shared accesses dominate as it may
-		 * simply bounce migrations uselessly
-		 */
-		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
-		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
+			/*
+			 * Scale scan rate increases based on sharing. There is
+			 * an inverse relationship between the degree of sharing
+			 * and the adjustment made to the scanning period.
+			 * Broadly speaking the intent is that there is little
+			 * point scanning faster if shared accesses dominate as
+			 * it may simply bounce migrations uselessly
+			 */
+			ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS,
+							(private + shared));
+			diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
+		}
 	}

 	p->numa_scan_period = clamp(p->numa_scan_period + diff,
-- 
1.8.3.1



             reply	other threads:[~2014-10-08  6:44 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-08  6:43 Yasuaki Ishimatsu [this message]
2014-10-08  8:31 ` [PATCH] sched/fair: Care divide error in update_task_scan_period() Peter Zijlstra
2014-10-08 11:51 ` Wanpeng Li
2014-10-09  5:34   ` Yasuaki Ishimatsu
2014-10-08 16:42 ` Rik van Riel
2014-10-08 16:54   ` Peter Zijlstra
2014-10-09  5:19     ` Yasuaki Ishimatsu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5434DCFF.1040208@jp.fujitsu.com \
    --to=isimatu.yasuaki@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tkhai@yandex.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox