All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Rik van Riel <riel@surriel.com>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, mgorman@suse.de,
	jstancek@redhat.com
Subject: Re: [PATCH] sched,numa cap pte scanning overhead to 3% of run time
Date: Thu, 5 Nov 2015 16:34:02 +0100	[thread overview]
Message-ID: <20151105153402.GR17308@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20151104132515.07e41b75@annuminas.surriel.com>

On Wed, Nov 04, 2015 at 01:25:15PM -0500, Rik van Riel wrote:
> +++ b/kernel/sched/fair.c
> @@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
>  	unsigned long migrate, next_scan, now = jiffies;
>  	struct task_struct *p = current;
>  	struct mm_struct *mm = p->mm;
> +	u64 runtime = p->se.sum_exec_runtime;
>  	struct vm_area_struct *vma;
>  	unsigned long start, end;
>  	unsigned long nr_pte_updates = 0;
> @@ -2277,6 +2278,20 @@ void task_numa_work(struct callback_head *work)
>  	else
>  		reset_ptenuma_scan(p);
>  	up_read(&mm->mmap_sem);
> +
> +	/*
> +	 * There is a fundamental mismatch between the runtime based
> +	 * NUMA scanning at the task level, and the wall clock time
> +	 * NUMA scanning at the mm level. On a severely overloaded
> +	 * system, with very large processes, this mismatch can cause
> +	 * the system to spend all of its time in change_prot_numa().
> +	 * Limit NUMA PTE scanning to 3% of the task's run time, if
> +	 * we spent so much time scanning we got rescheduled.
> +	 */
> +	if (unlikely(p->se.sum_exec_runtime != runtime)) {
> +		u64 diff = p->se.sum_exec_runtime - runtime;
> +		p->node_stamp += 32 * diff;
> +	}

I don't actually see how this does what it says it does.

>  }
>  
>  /*
> @@ -2302,7 +2317,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
>  	now = curr->se.sum_exec_runtime;
>  	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
>  
> -	if (now - curr->node_stamp > period) {
> +	if (now > curr->node_stamp + period) {
>  		if (!curr->node_stamp)
>  			curr->numa_scan_period = task_scan_min(curr);
>  		curr->node_stamp += period;

And this really should be an independent patch. Although the fix I had
in mind looked like:

	if ((s64)(now - curr->node_stamp) > period)

But I suppose this works too.

  reply	other threads:[~2015-11-05 15:34 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-04 18:25 [PATCH] sched,numa cap pte scanning overhead to 3% of run time Rik van Riel
2015-11-05 15:34 ` Peter Zijlstra [this message]
2015-11-05 15:56   ` Rik van Riel
2015-11-05 16:37     ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151105153402.GR17308@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=jstancek@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=riel@surriel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.