From: Mel Gorman <mgorman@suse.de>
To: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
mingo@kernel.org, Andrea Arcangeli <aarcange@redhat.com>,
Jan Stancek <jstancek@redhat.com>
Subject: Re: [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work
Date: Fri, 11 Sep 2015 16:05:44 +0100 [thread overview]
Message-ID: <20150911150544.GL25655@suse.de> (raw)
In-Reply-To: <20150911090027.4a7987bd@annuminas.surriel.com>
On Fri, Sep 11, 2015 at 09:00:27AM -0400, Rik van Riel wrote:
> Currently task_numa_work scans up to numa_balancing_scan_size_mb worth
> of memory per invocation, but only counts memory areas that have at
> least one PTE that is still present and not marked for numa hint faulting.
>
> It will skip over arbitarily large amounts of memory that are either
> unused, full of swap ptes, or full of PTEs that were already marked
> for NUMA hint faults but have not been faulted on yet.
>
This was deliberate and intended to cover a case whereby a process sparsely
using the address space would quickly skip over the sparse portions and
reach the active portions. Obviously you've found that this is not always
a great idea.
> This can cause excessive amounts of CPU use, due to there being
> essentially no upper limit on the scan rate of very large processes
> that are not yet in a phase where they are actively accessing old
> memory pages (eg. they are still initializing their data).
>
> Avoid that problem by placing an upper limit on the amount of virtual
> memory that task_numa_work scans in each invocation. This can be a
> higher limit than "pages", to ensure the task still skips over unused
> areas fairly quickly.
>
> While we are here, also fix the "nr_pte_updates" logic, so it only
> counts page ranges with ptes in them.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Jan Stancek <jstancek@redhat.com>
> ---
> kernel/sched/fair.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e2e3483b1ec..ff51b559ccaf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2157,7 +2157,7 @@ void task_numa_work(struct callback_head *work)
> struct vm_area_struct *vma;
> unsigned long start, end;
> unsigned long nr_pte_updates = 0;
> - long pages;
> + long pages, virtpages;
>
> WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
>
> @@ -2203,9 +2203,11 @@ void task_numa_work(struct callback_head *work)
> start = mm->numa_scan_offset;
> pages = sysctl_numa_balancing_scan_size;
> pages <<= 20 - PAGE_SHIFT; /* MB in pages */
> + virtpages = pages * 8; /* Scan up to this much virtual space */
> if (!pages)
> return;
>
> +
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, start);
> if (!vma) {
> @@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work)
> start = max(start, vma->vm_start);
> end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
> end = min(end, vma->vm_end);
> - nr_pte_updates += change_prot_numa(vma, start, end);
> + nr_pte_updates = change_prot_numa(vma, start, end);
>
Are you *sure* about this particular change?
The intent is that sparse space be skipped until the first updated PTE
is found and then scan sysctl_numa_balancing_scan_size pages after that.
With this change, if we find a single PTE in the middle of a sparse space
than we stop updating pages in the nr_pte_updates check below. You get
protected from a lot of scanning by the virtpages check but it does not
seem this fix is necessary. It has an odd side-effect whereby we possible
scan more with this patch in some cases.
> /*
> - * Scan sysctl_numa_balancing_scan_size but ensure that
> - * at least one PTE is updated so that unused virtual
> - * address space is quickly skipped.
> + * Try to scan sysctl_numa_balancing_size worth of
> + * hpages that have at least one present PTE that
> + * is not already pte-numa. If the VMA contains
> + * areas that are unused or already full of prot_numa
> + * PTEs, scan up to virtpages, to skip through those
> + * areas faster.
> */
> if (nr_pte_updates)
> pages -= (end - start) >> PAGE_SHIFT;
> + virtpages -= (end - start) >> PAGE_SHIFT;
>
It's a pity there will potentially be a lot of useless dead scanning on
those processes but caching start addresses is both outside the scope of
this patch and has its own problems.
> start = end;
> - if (pages <= 0)
> + if (pages <= 0 || virtpages <= 0)
> goto out;
>
> cond_resched();
--
Mel Gorman
SUSE Labs
next prev parent reply other threads:[~2015-09-11 15:05 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-11 13:00 [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work Rik van Riel
2015-09-11 15:05 ` Mel Gorman [this message]
2015-09-11 15:57 ` Rik van Riel
2015-09-11 16:16 ` Mel Gorman
2015-09-18 8:48 ` [tip:sched/core] sched/numa: Limit the amount of virtual memory scanned in task_numa_work() tip-bot for Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150911150544.GL25655@suse.de \
--to=mgorman@suse.de \
--cc=aarcange@redhat.com \
--cc=jstancek@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.