From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752493AbbIKQQ2 (ORCPT ); Fri, 11 Sep 2015 12:16:28 -0400 Received: from mx2.suse.de ([195.135.220.15]:45199 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751617AbbIKQQ0 (ORCPT ); Fri, 11 Sep 2015 12:16:26 -0400 Date: Fri, 11 Sep 2015 17:16:23 +0100 From: Mel Gorman To: Rik van Riel Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, mingo@kernel.org, Andrea Arcangeli , Jan Stancek Subject: Re: [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work Message-ID: <20150911161623.GM25655@suse.de> References: <20150911090027.4a7987bd@annuminas.surriel.com> <20150911150544.GL25655@suse.de> <55F2F9EB.4050106@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <55F2F9EB.4050106@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 11, 2015 at 11:57:31AM -0400, Rik van Riel wrote: > On 09/11/2015 11:05 AM, Mel Gorman wrote: > > On Fri, Sep 11, 2015 at 09:00:27AM -0400, Rik van Riel wrote: > >> Currently task_numa_work scans up to numa_balancing_scan_size_mb worth > >> of memory per invocation, but only counts memory areas that have at > >> least one PTE that is still present and not marked for numa hint faulting. > >> > >> It will skip over arbitarily large amounts of memory that are either > >> unused, full of swap ptes, or full of PTEs that were already marked > >> for NUMA hint faults but have not been faulted on yet. > >> > > > > This was deliberate and intended to cover a case whereby a process sparsely > > using the address space would quickly skip over the sparse portions and > > reach the active portions. Obviously you've found that this is not always > > a great idea. > > Skipping over non-present pages is fine, since the scan > rate is keyed off the RSS. > > However, skipping over pages that are already marked > PROT_NONE / PTE_NUMA results in unmapping pages at a much > accelerated rate (sometimes using >90% of the CPU of the > task), because the pages that are already PROT_NONE / NUMA > _are_ counted as part of the RSS. > True. > >> @@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work) > >> start = max(start, vma->vm_start); > >> end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); > >> end = min(end, vma->vm_end); > >> - nr_pte_updates += change_prot_numa(vma, start, end); > >> + nr_pte_updates = change_prot_numa(vma, start, end); > >> > > > > Are you *sure* about this particular change? > > > > The intent is that sparse space be skipped until the first updated PTE > > is found and then scan sysctl_numa_balancing_scan_size pages after that. > > With this change, if we find a single PTE in the middle of a sparse space > > than we stop updating pages in the nr_pte_updates check below. You get > > protected from a lot of scanning by the virtpages check but it does not > > seem this fix is necessary. It has an odd side-effect whereby we possible > > scan more with this patch in some cases. > > True, it is possible that this patch would lead to more scanning > than before, if a process has present PTEs interleaved with areas > that are either sparsely populated, or already marked PROT_NONE. > > However, was your intention to not quickly skip over empty areas > that come right after one single present PTE, but only over empty > areas at the beginning of a scan area? > The intent was to skip over inactive areas which potentially are marked PROT_NONE but not being addressed. Just because it was the intent does not mean it was the best idea though. I can easily see how the accelerated scan rate would occur and why it needs to be mitigated. I just wanted to be 100% sure I understand what you were thinking and what problem you encountered. Acked-by: Mel Gorman Thanks. -- Mel Gorman SUSE Labs