Re: [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rik van Riel <riel@redhat.com>
To: Mel Gorman <mgorman@suse.de>
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	mingo@kernel.org, Andrea Arcangeli <aarcange@redhat.com>,
	Jan Stancek <jstancek@redhat.com>
Subject: Re: [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work
Date: Fri, 11 Sep 2015 11:57:31 -0400	[thread overview]
Message-ID: <55F2F9EB.4050106@redhat.com> (raw)
In-Reply-To: <20150911150544.GL25655@suse.de>

On 09/11/2015 11:05 AM, Mel Gorman wrote:
> On Fri, Sep 11, 2015 at 09:00:27AM -0400, Rik van Riel wrote:
>> Currently task_numa_work scans up to numa_balancing_scan_size_mb worth
>> of memory per invocation, but only counts memory areas that have at
>> least one PTE that is still present and not marked for numa hint faulting.
>>
>> It will skip over arbitarily large amounts of memory that are either
>> unused, full of swap ptes, or full of PTEs that were already marked
>> for NUMA hint faults but have not been faulted on yet.
>>
> 
> This was deliberate and intended to cover a case whereby a process sparsely
> using the address space would quickly skip over the sparse portions and
> reach the active portions. Obviously you've found that this is not always
> a great idea.

Skipping over non-present pages is fine, since the scan
rate is keyed off the RSS.

However, skipping over pages that are already marked
PROT_NONE / PTE_NUMA results in unmapping pages at a much
accelerated rate (sometimes using >90% of the CPU of the
task), because the pages that are already PROT_NONE / NUMA
_are_ counted as part of the RSS.

>> @@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work)
>>  			start = max(start, vma->vm_start);
>>  			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
>>  			end = min(end, vma->vm_end);
>> -			nr_pte_updates += change_prot_numa(vma, start, end);
>> +			nr_pte_updates = change_prot_numa(vma, start, end);
>>  
> 
> Are you *sure* about this particular change?
> 
> The intent is that sparse space be skipped until the first updated PTE
> is found and then scan sysctl_numa_balancing_scan_size pages after that.
> With this change, if we find a single PTE in the middle of a sparse space
> than we stop updating pages in the nr_pte_updates check below. You get
> protected from a lot of scanning by the virtpages check but it does not
> seem this fix is necessary.  It has an odd side-effect whereby we possible
> scan more with this patch in some cases.

True, it is possible that this patch would lead to more scanning
than before, if a process has present PTEs interleaved with areas
that are either sparsely populated, or already marked PROT_NONE.

However, was your intention to not quickly skip over empty areas
that come right after one single present PTE, but only over empty
areas at the beginning of a scan area?

If so, I don't understand the logic behind that, and would like
to know more :)

>>  			/*
>> -			 * Scan sysctl_numa_balancing_scan_size but ensure that
>> -			 * at least one PTE is updated so that unused virtual
>> -			 * address space is quickly skipped.
>> +			 * Try to scan sysctl_numa_balancing_size worth of
>> +			 * hpages that have at least one present PTE that
>> +			 * is not already pte-numa. If the VMA contains
>> +			 * areas that are unused or already full of prot_numa
>> +			 * PTEs, scan up to virtpages, to skip through those
>> +			 * areas faster.
>>  			 */
>>  			if (nr_pte_updates)
>>  				pages -= (end - start) >> PAGE_SHIFT;
>> +			virtpages -= (end - start) >> PAGE_SHIFT;
>>  
> 
> It's a pity there will potentially be a lot of useless dead scanning on
> those processes but caching start addresses is both outside the scope of
> this patch and has its own problems.

The problem has been observed when processes already have a lot of
pages marked PROT_NONE by change_prot_numa(), and change_prot_numa()
returning zero because no PTEs were hanged.

In that case, the amount of useless dead scanning should be a whole
lot less with this patch, than without.

I do not quite understand how this patch makes it worse, though.

-- 
All rights reversed

next prev parent reply	other threads:[~2015-09-11 15:57 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-11 13:00 [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work Rik van Riel
2015-09-11 15:05 ` Mel Gorman
2015-09-11 15:57   ` Rik van Riel [this message]
2015-09-11 16:16     ` Mel Gorman
2015-09-18  8:48 ` [tip:sched/core] sched/numa: Limit the amount of virtual memory scanned in task_numa_work() tip-bot for Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55F2F9EB.4050106@redhat.com \
    --to=riel@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=jstancek@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.