[PATCH 2/2] sched,numa: cap pte scanning overhead to 3% of run time

All of lore.kernel.org
 help / color / mirror / Atom feed

From: riel@redhat.com
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@kernel.org, jstancek@redhat.com,
	mgorman@suse.de
Subject: [PATCH 2/2] sched,numa: cap pte scanning overhead to 3% of run time
Date: Thu,  5 Nov 2015 15:56:23 -0500	[thread overview]
Message-ID: <1446756983-28173-3-git-send-email-riel@redhat.com> (raw)
In-Reply-To: <1446756983-28173-1-git-send-email-riel@redhat.com>

From: Rik van Riel <riel@redhat.com>

There is a fundamental mismatch between the runtime based NUMA scanning
at the task level, and the wall clock time NUMA scanning at the mm level.
On a severely overloaded system, with very large processes, this mismatch
can cause the system to spend all of its time in change_prot_numa().

This can happen if the task spends at least two ticks in change_prot_numa(),
and only gets two ticks of CPU time in the real time between two scan
intervals of the mm.

This patch ensures that a task never spends more than 3% of run
time scanning PTEs. It does that by ensuring that in-between
task_numa_work runs, the task spends at least 32x as much time on
other things than it did on task_numa_work.

This is done stochastically: if a timer tick happens, or the task
gets rescheduled during task_numa_work, we delay a future run of
task_numa_work until the task has spent at least 32x the amount of
CPU time doing something else, as it spent inside task_numa_work.
The longer task_numa_work takes, the more likely it is this happens.

If task_numa_work takes very little time, chances are low that that
code will do anything, but we will not care.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f04fda8f669c..b0924377ab0d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
 	unsigned long migrate, next_scan, now = jiffies;
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
+	u64 runtime = p->se.sum_exec_runtime;
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
@@ -2277,6 +2278,17 @@ void task_numa_work(struct callback_head *work)
 	else
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Make sure tasks use at least 32x as much time to run other code
+	 * than they used here, to limit NUMA PTE scanning overhead to 3% max.
+	 * Usually update_task_scan_period slows down scanning enough; on an
+	 * overloaded system we need to limit overhead on a per task basis.
+	 */
+	if (unlikely(p->se.sum_exec_runtime != runtime)) {
+		u64 diff = p->se.sum_exec_runtime - runtime;
+		p->node_stamp += 32 * diff;
+	}
 }

 /*
-- 
2.1.0

next prev parent reply	other threads:[~2015-11-05 20:56 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-05 20:56 [PATCH 0/2] sched,numa: cap pte scanning overhead to 3% of run time riel
2015-11-05 20:56 ` [PATCH 1/2] sched,numa: fix math underflow in task_tick_numa riel
2015-11-10  6:40   ` [tip:sched/urgent] sched/numa: Fix math underflow in task_tick_numa() tip-bot for Rik van Riel
2015-11-05 20:56 ` riel [this message]
2015-11-23 16:19   ` [tip:sched/core] sched/numa: Cap PTE scanning overhead to 3% of run time tip-bot for Rik van Riel

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:f04fda8f669 dfblob:b0924377ab0 )
 OR (
bs:"[PATCH 2/2] sched,numa: cap pte scanning overhead to 3% of run time" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1446756983-28173-3-git-send-email-riel@redhat.com \
    --to=riel@redhat.com \
    --cc=jstancek@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.