From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753369AbbIKP5e (ORCPT ); Fri, 11 Sep 2015 11:57:34 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35072 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752856AbbIKP5d (ORCPT ); Fri, 11 Sep 2015 11:57:33 -0400 Message-ID: <55F2F9EB.4050106@redhat.com> Date: Fri, 11 Sep 2015 11:57:31 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Mel Gorman CC: linux-kernel@vger.kernel.org, peterz@infradead.org, mingo@kernel.org, Andrea Arcangeli , Jan Stancek Subject: Re: [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work References: <20150911090027.4a7987bd@annuminas.surriel.com> <20150911150544.GL25655@suse.de> In-Reply-To: <20150911150544.GL25655@suse.de> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/11/2015 11:05 AM, Mel Gorman wrote: > On Fri, Sep 11, 2015 at 09:00:27AM -0400, Rik van Riel wrote: >> Currently task_numa_work scans up to numa_balancing_scan_size_mb worth >> of memory per invocation, but only counts memory areas that have at >> least one PTE that is still present and not marked for numa hint faulting. >> >> It will skip over arbitarily large amounts of memory that are either >> unused, full of swap ptes, or full of PTEs that were already marked >> for NUMA hint faults but have not been faulted on yet. >> > > This was deliberate and intended to cover a case whereby a process sparsely > using the address space would quickly skip over the sparse portions and > reach the active portions. Obviously you've found that this is not always > a great idea. Skipping over non-present pages is fine, since the scan rate is keyed off the RSS. However, skipping over pages that are already marked PROT_NONE / PTE_NUMA results in unmapping pages at a much accelerated rate (sometimes using >90% of the CPU of the task), because the pages that are already PROT_NONE / NUMA _are_ counted as part of the RSS. >> @@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work) >> start = max(start, vma->vm_start); >> end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); >> end = min(end, vma->vm_end); >> - nr_pte_updates += change_prot_numa(vma, start, end); >> + nr_pte_updates = change_prot_numa(vma, start, end); >> > > Are you *sure* about this particular change? > > The intent is that sparse space be skipped until the first updated PTE > is found and then scan sysctl_numa_balancing_scan_size pages after that. > With this change, if we find a single PTE in the middle of a sparse space > than we stop updating pages in the nr_pte_updates check below. You get > protected from a lot of scanning by the virtpages check but it does not > seem this fix is necessary. It has an odd side-effect whereby we possible > scan more with this patch in some cases. True, it is possible that this patch would lead to more scanning than before, if a process has present PTEs interleaved with areas that are either sparsely populated, or already marked PROT_NONE. However, was your intention to not quickly skip over empty areas that come right after one single present PTE, but only over empty areas at the beginning of a scan area? If so, I don't understand the logic behind that, and would like to know more :) >> /* >> - * Scan sysctl_numa_balancing_scan_size but ensure that >> - * at least one PTE is updated so that unused virtual >> - * address space is quickly skipped. >> + * Try to scan sysctl_numa_balancing_size worth of >> + * hpages that have at least one present PTE that >> + * is not already pte-numa. If the VMA contains >> + * areas that are unused or already full of prot_numa >> + * PTEs, scan up to virtpages, to skip through those >> + * areas faster. >> */ >> if (nr_pte_updates) >> pages -= (end - start) >> PAGE_SHIFT; >> + virtpages -= (end - start) >> PAGE_SHIFT; >> > > It's a pity there will potentially be a lot of useless dead scanning on > those processes but caching start addresses is both outside the scope of > this patch and has its own problems. The problem has been observed when processes already have a lot of pages marked PROT_NONE by change_prot_numa(), and change_prot_numa() returning zero because no PTEs were hanged. In that case, the amount of useless dead scanning should be a whole lot less with this patch, than without. I do not quite understand how this patch makes it worse, though. -- All rights reversed