From: riel@redhat.com
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@kernel.org, jstancek@redhat.com,
mgorman@suse.de
Subject: [PATCH 2/2] sched,numa: cap pte scanning overhead to 3% of run time
Date: Thu, 5 Nov 2015 15:56:23 -0500 [thread overview]
Message-ID: <1446756983-28173-3-git-send-email-riel@redhat.com> (raw)
In-Reply-To: <1446756983-28173-1-git-send-email-riel@redhat.com>
From: Rik van Riel <riel@redhat.com>
There is a fundamental mismatch between the runtime based NUMA scanning
at the task level, and the wall clock time NUMA scanning at the mm level.
On a severely overloaded system, with very large processes, this mismatch
can cause the system to spend all of its time in change_prot_numa().
This can happen if the task spends at least two ticks in change_prot_numa(),
and only gets two ticks of CPU time in the real time between two scan
intervals of the mm.
This patch ensures that a task never spends more than 3% of run
time scanning PTEs. It does that by ensuring that in-between
task_numa_work runs, the task spends at least 32x as much time on
other things than it did on task_numa_work.
This is done stochastically: if a timer tick happens, or the task
gets rescheduled during task_numa_work, we delay a future run of
task_numa_work until the task has spent at least 32x the amount of
CPU time doing something else, as it spent inside task_numa_work.
The longer task_numa_work takes, the more likely it is this happens.
If task_numa_work takes very little time, chances are low that that
code will do anything, but we will not care.
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f04fda8f669c..b0924377ab0d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
unsigned long migrate, next_scan, now = jiffies;
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
+ u64 runtime = p->se.sum_exec_runtime;
struct vm_area_struct *vma;
unsigned long start, end;
unsigned long nr_pte_updates = 0;
@@ -2277,6 +2278,17 @@ void task_numa_work(struct callback_head *work)
else
reset_ptenuma_scan(p);
up_read(&mm->mmap_sem);
+
+ /*
+ * Make sure tasks use at least 32x as much time to run other code
+ * than they used here, to limit NUMA PTE scanning overhead to 3% max.
+ * Usually update_task_scan_period slows down scanning enough; on an
+ * overloaded system we need to limit overhead on a per task basis.
+ */
+ if (unlikely(p->se.sum_exec_runtime != runtime)) {
+ u64 diff = p->se.sum_exec_runtime - runtime;
+ p->node_stamp += 32 * diff;
+ }
}
/*
--
2.1.0
next prev parent reply other threads:[~2015-11-05 20:56 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-05 20:56 [PATCH 0/2] sched,numa: cap pte scanning overhead to 3% of run time riel
2015-11-05 20:56 ` [PATCH 1/2] sched,numa: fix math underflow in task_tick_numa riel
2015-11-10 6:40 ` [tip:sched/urgent] sched/numa: Fix math underflow in task_tick_numa() tip-bot for Rik van Riel
2015-11-05 20:56 ` riel [this message]
2015-11-23 16:19 ` [tip:sched/core] sched/numa: Cap PTE scanning overhead to 3% of run time tip-bot for Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1446756983-28173-3-git-send-email-riel@redhat.com \
--to=riel@redhat.com \
--cc=jstancek@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.