From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f45.google.com (mail-wm0-f45.google.com [74.125.82.45]) by kanga.kvack.org (Postfix) with ESMTP id 32DD66B0005 for ; Tue, 16 Feb 2016 19:08:06 -0500 (EST) Received: by mail-wm0-f45.google.com with SMTP id b205so133100318wmb.1 for ; Tue, 16 Feb 2016 16:08:06 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id g67si674473wmi.14.2016.02.16.16.08.04 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 16 Feb 2016 16:08:05 -0800 (PST) Date: Tue, 16 Feb 2016 16:08:02 -0800 From: Andrew Morton Subject: Re: [PATCH V4][for-next]mm: add a new vector based madvise syscall Message-Id: <20160216160802.50ceaf10aa16588e18b3d2c5@linux-foundation.org> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Shaohua Li Cc: linux-mm@kvack.org, linux-api@vger.kernel.org, Kernel-team@fb.com, Rik van Riel , Mel Gorman , Hugh Dickins , Johannes Weiner , Andrea Arcangeli , Andi Kleen , Minchan Kim , Arnd Bergmann On Thu, 10 Dec 2015 16:03:37 -0800 Shaohua Li wrote: > In jemalloc, a free(3) doesn't immediately free the memory to OS even > the memory is page aligned/size, and hope the memory can be reused soon. > Later the virtual address becomes fragmented, and more and more free > memory are aggregated. If the free memory size is large, jemalloc uses > madvise(DONT_NEED) to actually free the memory back to OS. > > The madvise has significantly overhead paritcularly because of TLB > flush. jemalloc does madvise for several virtual address space ranges > one time. Instead of calling madvise for each of the ranges, we > introduce a new syscall to purge memory for several ranges one time. In > this way, we can merge several TLB flush for the ranges to one big TLB > flush. This also reduce mmap_sem locking and kernel/userspace switching. > > I'm running a simple memory allocation benchmark. 32 threads do random > malloc/free/realloc. CPU count? (Does that matter much?) > Corresponding jemalloc patch to utilize this API is > attached. No it isn't ;) Who maintains jemalloc? Are they signed up to actually apply the patch? It would be bad to add the patch to the kernel and then find that the jemalloc maintainers choose not to use it! > Without patch: > real 0m18.923s > user 1m11.819s > sys 7m44.626s > each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush > is hotest functions. mmap_sem read locking (because of page fault) is > also heavy. > > with patch: > real 0m15.026s > user 0m48.548s > sys 6m41.153s > each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at > all. mmap_sem read locking (still because of page fault) becomes the > sole hot spot. This is a somewhat underwhelming improvement, given that it's a synthetic microbenchmark. > Another test malloc a bunch of memory in 48 threads, then all threads > free the memory. I measure the time of the memory free. > Without patch: 34.332s > With patch: 17.429s This is more whelming. Do we have a feel for how much benefit this patch will have for real-world workloads? That's pretty important. > MADV_FREE does the same TLB flush as MADV_NEED, this also applies to I'll do s/MADV_NEED/MADV_DONTNEED/ > MADV_FREE. Other madvise type can have small benefits too, like reduce > syscalls/mmap_sem locking. Could we please get a testcase for the syscall(s) into tools/testing/selftests/vm? For long-term maintenance reasons and as a service to arch maintainers - make it easy for them to check the functionality without having to roll their own (possibly incomplete) test app. I'm not sure *how* we'd develop a test case. Use mincore()? > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -21,7 +21,10 @@ > #include > #include > #include > - > +#include > +#ifdef CONFIG_COMPAT > +#include > +#endif I'll nuke the ifdefs - compat.h already does that. It would be good for us to have a look at the manpage before going too far with the patch - this helps reviewers to think about the proposed interface and behaviour. I'll queue this up for a bit of testing, although it won't get tested much. The syscall fuzzers will presumably hit on it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org