From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751063AbdE3Pnb (ORCPT ); Tue, 30 May 2017 11:43:31 -0400 Received: from mx1.redhat.com ([209.132.183.28]:41344 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751022AbdE3Pn3 (ORCPT ); Tue, 30 May 2017 11:43:29 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 65A3485543 Authentication-Results: ext-mx04.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx04.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=aarcange@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 65A3485543 Date: Tue, 30 May 2017 17:43:26 +0200 From: Andrea Arcangeli To: Michal Hocko Cc: Mike Rapoport , Vlastimil Babka , "Kirill A. Shutemov" , Andrew Morton , Arnd Bergmann , "Kirill A. Shutemov" , Pavel Emelyanov , linux-mm , lkml , Linux API Subject: Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE Message-ID: <20170530154326.GB8412@redhat.com> References: <20170524075043.GB3063@rapoport-lnx> <20170524103947.GC3063@rapoport-lnx> <20170524111800.GD14733@dhcp22.suse.cz> <20170524142735.GF3063@rapoport-lnx> <20170530074408.GA7969@dhcp22.suse.cz> <20170530101921.GA25738@rapoport-lnx> <20170530103930.GB7969@dhcp22.suse.cz> <20170530140456.GA8412@redhat.com> <20170530143941.GK7969@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170530143941.GK7969@dhcp22.suse.cz> User-Agent: Mutt/1.8.2 (2017-04-18) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Tue, 30 May 2017 15:43:28 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 30, 2017 at 04:39:41PM +0200, Michal Hocko wrote: > I sysctl for the mapcount can be increased, right? I also assume that > those vmas will get merged after the post copy is done. Assuming you enlarge the sysctl to the worst possible case, with 64bit address space you can have billions of VMAs if you're migrating 4T of RAM and you're unlucky and the address space gets fragmented. The unswappable kernel memory overhead would be relatively large (i.e. dozen gigabytes of RAM in vm_area_struct slab), and each find_vma operation would need to walk ~40 steps across that large vma rbtree. There's a reason the sysctl exist. Not to tell all those unnecessary vma mangling operations would be protected by the mmap_sem for writing. Not creating a ton of vmas and enabling vma-less pte mangling with a single large vma and only using mmap_sem for reading during all the pte mangling, is one of the primary design motivations for userfaultfd. > I understand that part but it sounds awfully one purpose thing to me. > Are we going to add other MADVISE_RESET_$FOO to clear other flags just > because we can race in this specific use case? Those already exists, see for example MADV_NORMAL, clearing ~VM_RAND_READ & ~VM_SEQ_READ after calling MADV_SEQUENTIAL or MADV_RANDOM. Or MADV_DOFORK after MADV_DONTFORK. MADV_DONTDUMP after MADV_DODUMP. Etc.. > But we already have MADV_HUGEPAGE, MADV_NOHUGEPAGE and prctl to > enable/disable thp. Doesn't that sound little bit too much for a single > feature to you? MADV_NOHUGEPAGE doesn't mean clearing the flag set with MADV_HUGEPAGE. MADV_NOHUGEPAGE disables THP on the region if the global sysfs "enabled" tune is set to "always". MADV_HUGEPAGE enables THP if the global "enabled" sysfs tune is set to "madvise". The two MADV_NOHUGEPAGE and MADV_HUGEPAGE are needed to leverage the three-way setting of "never" "madvise" "always" of the global tune. The "madvise" global tune exists if you want to save RAM and you don't care much about performance but still allowing apps like QEMU where no memory is lost by enabling THP, to use THP. There's no way to clear either of those two flags and bring back the default behavior of the global sysfs tune, so it's not redundant at the very least.