From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f70.google.com (mail-wm0-f70.google.com [74.125.82.70]) by kanga.kvack.org (Postfix) with ESMTP id 1016B6B0009 for ; Thu, 1 Feb 2018 05:27:34 -0500 (EST) Received: by mail-wm0-f70.google.com with SMTP id d63so1454710wma.4 for ; Thu, 01 Feb 2018 02:27:34 -0800 (PST) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id 6sor6963686edl.54.2018.02.01.02.27.32 for (Google Transport Security); Thu, 01 Feb 2018 02:27:32 -0800 (PST) Date: Thu, 1 Feb 2018 13:27:30 +0300 From: "Kirill A. Shutemov" Subject: Re: [PATCH v2] mm: Reduce memory bloat with THP Message-ID: <20180201102730.al4jl2raldfgoy7f@node.shutemov.name> References: <1516318444-30868-1-git-send-email-nitingupta910@gmail.com> <20180119124957.GA6584@dhcp22.suse.cz> <59F98618-C49F-48A8-BCA1-A8F717888BAA@cs.rutgers.edu> <4d7ce874-9771-ad5f-c064-52a46fc37689@oracle.com> <20180125211303.rbfeg7ultwr6hpd3@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180125211303.rbfeg7ultwr6hpd3@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Nitin Gupta , Zi Yan , Michal Hocko , Nitin Gupta , steven.sistare@oracle.com, Andrew Morton , Ingo Molnar , Nadav Amit , Minchan Kim , "Kirill A. Shutemov" , Peter Zijlstra , Vegard Nossum , "Levin, Alexander" , Mike Rapoport , Hillf Danton , Shaohua Li , Anshuman Khandual , Andrea Arcangeli , David Rientjes , Rik van Riel , Jan Kara , Dave Jiang , J?r?me Glisse , Matthew Wilcox , Ross Zwisler , Hugh Dickins , Tobin C Harding , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Thu, Jan 25, 2018 at 09:13:03PM +0000, Mel Gorman wrote: > On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote: > > >> It's not really about memory scarcity but a more efficient use of it. > > >> Applications may want hugepage benefits without requiring any changes to > > >> app code which is what THP is supposed to provide, while still avoiding > > >> memory bloat. > > >> > > > I read these links and find that there are mainly two complains: > > > 1. THP causes latency spikes, because direction compaction slows down THP allocation, > > > 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return memory ranges smaller than > > > THP size and fails because of THP. > > > > > > The first complain is not related to this patch. > > > > I'm trying to address many different THP issues and memory bloat is > > first among them. > > Expecting userspace to get this right is probably going to go sideways. > It'll be screwed up and be sub-optimal or have odd semantics for existing > madvise flags. The fact is that an application may not even know if it's > going to be sparsely using memory in advance if it's a computation load > modelling from unknown input data. > > I suggest you read the old Talluri paper "Superpassing the TLB Performance > of Superpages with Less Operating System Support" and pay attention to > Section 4. There it discusses a page reservation scheme whereby on fault > a naturally aligned set of base pages are reserved and only one correctly > placed base page is inserted into the faulting address. It was tied into > a hypothetical piece of hardware that doesn't exist to give best-effort > support for superpages so it does not directly help you but the initial > idea is sound. There are holes in the paper from todays perspective but > it was written in the 90's. > > From there, read "Transparent operating system support for superpages" > by Navarro, particularly chapter 4 paying attention to the parts where > it talks about opportunism and promotion threshold. > > Superficially, it goes like this > > 1. On fault, reserve a THP in the allocator and use one base page that > is correctly-aligned for the faulting addresses. By correctly-aligned, > I mean that you use base page whose offset would be naturally contiguous > if it ever was part of a huge page. > 2. On subsequent faults, attempt to use a base page that is naturally > aligned to be a THP > 3. When a "threshold" of base pages are inserted, allocate the remaining > pages and promote it to a THP > 4. If there is memory pressure, spill "reserved" pages into the main > allocation pool and lose the opportunity to promote (which will need > khugepaged to recover) > > By definition, a promotion threshold of 1 would be the existing scheme > of allocation a THP on the first fault and some users will want that. It > also should be the default to avoid unexpected overhead. For workloads > where memory is being sparsely addressed and the increased overhead of > THP is unwelcome then the threshold should be tuned higher with a maximum > possible value of HPAGE_PMD_NR. > > It's non-trivial to do this because at minimum a page fault has to check > if there is a potential promotion candidate by checking the PTEs around > the faulting address searching for a correctly-aligned base page that is > already inserted. If there is, then check if the correctly aligned base > page for the current faulting address is free and if so use it. It'll > also then need to check the remaining PTEs to see if both the promotion > threshold has been reached and if so, promote it to a THP (or else teach > khugepaged to do an in-place promotion if possible). In other words, > implementing the promotion threshold is both hard and it's not free. "not free" is understatement. Converting PTE page table to PMD would require down_write(mmap_sem). Doing it from within page fault path would also mean that we need to drop down_read(mmap) we hold, re-aquaire it with down_write(), find the vma again and re-validate that nothing changed in meanwhile... That's an interesting exercise, but I'm skeptical it would result in anything practical. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org