All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Rik van Riel <riel@surriel.com>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, mgorman@suse.de,
	jstancek@redhat.com
Subject: Re: [PATCH] sched,numa cap pte scanning overhead to 3% of run time
Date: Thu, 5 Nov 2015 17:37:12 +0100	[thread overview]
Message-ID: <20151105163712.GE3604@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <563B7C2D.90008@surriel.com>

On Thu, Nov 05, 2015 at 10:56:29AM -0500, Rik van Riel wrote:
> On 11/05/2015 10:34 AM, Peter Zijlstra wrote:
> > On Wed, Nov 04, 2015 at 01:25:15PM -0500, Rik van Riel wrote:
> >> +++ b/kernel/sched/fair.c
> >> @@ -2155,6 +2155,7 @@ void task_numa_work(struct callback_head *work)
> >>  	unsigned long migrate, next_scan, now = jiffies;
> >>  	struct task_struct *p = current;
> >>  	struct mm_struct *mm = p->mm;
> >> +	u64 runtime = p->se.sum_exec_runtime;
> >>  	struct vm_area_struct *vma;
> >>  	unsigned long start, end;
> >>  	unsigned long nr_pte_updates = 0;
> >> @@ -2277,6 +2278,20 @@ void task_numa_work(struct callback_head *work)
> >>  	else
> >>  		reset_ptenuma_scan(p);
> >>  	up_read(&mm->mmap_sem);
> >> +
> >> +	/*
> >> +	 * There is a fundamental mismatch between the runtime based
> >> +	 * NUMA scanning at the task level, and the wall clock time
> >> +	 * NUMA scanning at the mm level. On a severely overloaded
> >> +	 * system, with very large processes, this mismatch can cause
> >> +	 * the system to spend all of its time in change_prot_numa().
> >> +	 * Limit NUMA PTE scanning to 3% of the task's run time, if
> >> +	 * we spent so much time scanning we got rescheduled.
> >> +	 */
> >> +	if (unlikely(p->se.sum_exec_runtime != runtime)) {
> >> +		u64 diff = p->se.sum_exec_runtime - runtime;
> >> +		p->node_stamp += 32 * diff;
> >> +	}
> > 
> > I don't actually see how this does what it says it does
> 
> If we got rescheduled during the assigning of runtime

Or just had a tick. Even if the whole thing took a fraction of a ms but
we got unlucky and got hit by a tick the sum_exec_runtime would get
updated and not match here.

> Advancing the node_stamp by 32x the amount of time
> the task consumed between entering task_numa_work and
> this point should ensure task_numa_work does not get
> queued again until we have used 32x as much time doing
> something else.

> What am I missing?

The above, issue and the fact that I'm really tired and didn't do 1:32 ~
3%.

So the tick scenario can cause a 32*TICK_NSEC delay even though we spend
much less than TICK_NSEC time scanning, dropping th effective rate much
below the 3%.

Not sure it makes sense to do more accurate accounting, but I suppose we
should mention it somewhere.

> >> @@ -2302,7 +2317,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
> >>  	now = curr->se.sum_exec_runtime;
> >>  	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
> >>  
> >> -	if (now - curr->node_stamp > period) {
> >> +	if (now > curr->node_stamp + period) {
> >>  		if (!curr->node_stamp)
> >>  			curr->numa_scan_period = task_scan_min(curr);
> >>  		curr->node_stamp += period;

> I can resend this as a separate patch if you prefer.

Yes, its an unrelated fix.

      reply	other threads:[~2015-11-05 16:37 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-04 18:25 [PATCH] sched,numa cap pte scanning overhead to 3% of run time Rik van Riel
2015-11-05 15:34 ` Peter Zijlstra
2015-11-05 15:56   ` Rik van Riel
2015-11-05 16:37     ` Peter Zijlstra [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151105163712.GE3604@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=jstancek@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=riel@surriel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.