linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
To: Shaohua Li <shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	Michael Kerrisk
	<mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>,
	KOSAKI Motohiro
	<kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>,
	Jason Evans <je-b10kYP2dOMg@public.gmane.org>,
	Daniel Micay
	<danielmicay-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"Kirill A. Shutemov"
	<kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>,
	Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	bmaurer-b10kYP2dOMg@public.gmane.org,
	John Stultz <john.stultz-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
Date: Thu, 5 Nov 2015 10:37:41 +0900	[thread overview]
Message-ID: <20151105013741.GI7357@bbox> (raw)
In-Reply-To: <20151105013350.GH7357@bbox>

On Thu, Nov 05, 2015 at 10:33:50AM +0900, Minchan Kim wrote:
> On Wed, Nov 04, 2015 at 12:00:06PM -0800, Shaohua Li wrote:
> > On Wed, Nov 04, 2015 at 10:25:55AM +0900, Minchan Kim wrote:
> > > Linux doesn't have an ability to free pages lazy while other OS already
> > > have been supported that named by madvise(MADV_FREE).
> > > 
> > > The gain is clear that kernel can discard freed pages rather than swapping
> > > out or OOM if memory pressure happens.
> > > 
> > > Without memory pressure, freed pages would be reused by userspace without
> > > another additional overhead(ex, page fault + allocation + zeroing).
> > > 
> > > Jason Evans said:
> > > 
> > > : Facebook has been using MAP_UNINITIALIZED
> > > : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
> > > : several years, but there are operational costs to maintaining this
> > > : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
> > > : in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
> > > : increased throughput for much of our workload by ~5%, and although the
> > > : benefit has decreased using newer hardware and kernels, there is still
> > > : enough benefit that we cannot reasonably retire it without a replacement.
> > > :
> > > : Aside from Facebook operations, there are numerous broadly used
> > > : applications that would benefit from MADV_FREE.  The ones that immediately
> > > : come to mind are redis, varnish, and MariaDB.  I don't have much insight
> > > : into Android internals and development process, but I would hope to see
> > > : MADV_FREE support eventually end up there as well to benefit applications
> > > : linked with the integrated jemalloc.
> > > :
> > > : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
> > > : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
> > > : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
> > > : (and AIX, but I'm not sure it even compiles on AIX).  The lack of
> > > : MADV_FREE on Linux forced me down a long series of increasingly
> > > : sophisticated heuristics for madvise() volume reduction, and even so this
> > > : remains a common performance issue for people using jemalloc on Linux.
> > > : Please integrate MADV_FREE; many people will benefit substantially.
> > > 
> > > How it works:
> > > 
> > > When madvise syscall is called, VM clears dirty bit of ptes of the range.
> > > If memory pressure happens, VM checks dirty bit of page table and if it
> > > found still "clean", it means it's a "lazyfree pages" so VM could discard
> > > the page instead of swapping out.  Once there was store operation for the
> > > page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> > > the page instead of discarding.
> > > 
> > > Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
> > > and hope glibc supports it) and jemalloc/tcmalloc already have supported
> > > the feature for other OS(ex, FreeBSD)
> > > 
> > > barrios@blaptop:~/benchmark/ebizzy$ lscpu
> > > Architecture:          x86_64
> > > CPU op-mode(s):        32-bit, 64-bit
> > > Byte Order:            Little Endian
> > > CPU(s):                12
> > > On-line CPU(s) list:   0-11
> > > Thread(s) per core:    1
> > > Core(s) per socket:    1
> > > Socket(s):             12
> > > NUMA node(s):          1
> > > Vendor ID:             GenuineIntel
> > > CPU family:            6
> > > Model:                 2
> > > Stepping:              3
> > > CPU MHz:               3200.185
> > > BogoMIPS:              6400.53
> > > Virtualization:        VT-x
> > > Hypervisor vendor:     KVM
> > > Virtualization type:   full
> > > L1d cache:             32K
> > > L1i cache:             32K
> > > L2 cache:              4096K
> > > NUMA node0 CPU(s):     0-11
> > > ebizzy benchmark(./ebizzy -S 10 -n 512)
> > > 
> > > Higher avg is better.
> > > 
> > >  vanilla-jemalloc		MADV_free-jemalloc
> > > 
> > > 1 thread
> > > records: 10			    records: 10
> > > avg:	2961.90			    avg:   12069.70
> > > std:	  71.96(2.43%)		    std:     186.68(1.55%)
> > > max:	3070.00			    max:   12385.00
> > > min:	2796.00			    min:   11746.00
> > > 
> > > 2 thread
> > > records: 10			    records: 10
> > > avg:	5020.00			    avg:   17827.00
> > > std:	 264.87(5.28%)		    std:     358.52(2.01%)
> > > max:	5244.00			    max:   18760.00
> > > min:	4251.00			    min:   17382.00
> > > 
> > > 4 thread
> > > records: 10			    records: 10
> > > avg:	8988.80			    avg:   27930.80
> > > std:	1175.33(13.08%)		    std:    3317.33(11.88%)
> > > max:	9508.00			    max:   30879.00
> > > min:	5477.00			    min:   21024.00
> > > 
> > > 8 thread
> > > records: 10			    records: 10
> > > avg:   13036.50			    avg:   33739.40
> > > std:	 170.67(1.31%)		    std:    5146.22(15.25%)
> > > max:   13371.00			    max:   40572.00
> > > min:   12785.00			    min:   24088.00
> > > 
> > > 16 thread
> > > records: 10			    records: 10
> > > avg:   11092.40			    avg:   31424.20
> > > std:	 710.60(6.41%)		    std:    3763.89(11.98%)
> > > max:   12446.00			    max:   36635.00
> > > min:	9949.00			    min:   25669.00
> > > 
> > > 32 thread
> > > records: 10			    records: 10
> > > avg:   11067.00			    avg:   34495.80
> > > std:	 971.06(8.77%)		    std:    2721.36(7.89%)
> > > max:   12010.00			    max:   38598.00
> > > min:	9002.00			    min:   30636.00
> > > 
> > > In summary, MADV_FREE is about much faster than MADV_DONTNEED.
> > 
> > The MADV_FREE is discussed for a while, it probably is too late to propose
> > something new, but we had the new idea (from Ben Maurer, CCed) recently and
> > think it's better. Our target is still jemalloc.
> > 
> > Compared to MADV_DONTNEED, MADV_FREE's lazy memory free is a huge win to reduce
> > page fault. But there is one issue remaining, the TLB flush. Both MADV_DONTNEED
> > and MADV_FREE do TLB flush. TLB flush overhead is quite big in contemporary
> > multi-thread applications. In our production workload, we observed 80% CPU
> > spending on TLB flush triggered by jemalloc madvise(MADV_DONTNEED) sometimes.
> > We haven't tested MADV_FREE yet, but the result should be similar. It's hard to
> > avoid the TLB flush issue with MADV_FREE, because it helps avoid data
> > corruption.
> > 
> > The new proposal tries to fix the TLB issue. We introduce two madvise verbs:
> > 
> > MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel
> > just records the range in current stage. Should memory pressure happen, page
> > reclaim can free the memory directly regardless the pte state.
> > 
> > MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon.
> > Kernel deletes the record and prevents page reclaim discards the memory. If the
> > memory isn't reclaimed, userspace will access the old memory, otherwise do
> > normal page fault handling.
> > 
> > The point is to let userspace notify kernel if memory can be discarded, instead
> > of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is
> > required till page reclaim actually frees the memory (page reclaim need do the
> > TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of
> > MADV_FREE.
> > 
> > Compared to MADV_FREE, reusing memory with the new proposal isn't transparent,
> > eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc.
> > 
> > We don't have code to backup this yet, sorry. We'd like to discuss it if it
> > makes sense.
> 
> It's really what volatile range did.
> John Stultz and me tried it for a *long* time but it had lots of troubles.
> It's really hard to write it down in my time due to really long history
> and even I forgot lots of detail(ie, dead brain).
> Please search volatile ranges in google.
> Finally, people in LSF/MM suggested MADV_FREE to help anonymous page side
> rather than stucking hich prevent useful feature. :(

I should have Cced John Stutlz.

He would have good memory than me so he would help but I'm not sure
he has a interest on volatile ranges, still.

  reply	other threads:[~2015-11-05  1:37 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-04  1:25 [PATCH v2 00/13] MADV_FREE support Minchan Kim
2015-11-04  1:25 ` [PATCH v2 01/13] mm: support madvise(MADV_FREE) Minchan Kim
2015-11-04  2:29   ` Sergey Senozhatsky
2015-11-04 23:40     ` Minchan Kim
     [not found]   ` <1446600367-7976-2-git-send-email-minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-11-04  2:16     ` Sergey Senozhatsky
2015-11-04 23:39       ` Minchan Kim
2015-11-05  3:41         ` Sergey Senozhatsky
2015-11-04  3:41     ` Andy Lutomirski
2015-11-04  5:50       ` Daniel Micay
     [not found]         ` <56399CA5.8090101-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-11-04  5:53           ` Daniel Micay
2015-11-04  6:04             ` Daniel Micay
2015-11-04 18:23         ` Andy Lutomirski
2015-11-04 22:05           ` Daniel Micay
     [not found]             ` <563A813B.9080903-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-11-05 18:17               ` Shaohua Li
2015-11-05 20:13                 ` Daniel Micay
     [not found]                   ` <563BB855.6020304-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-11-05 20:14                     ` Daniel Micay
     [not found]       ` <CALCETrUuNs=26UQtkU88cKPomx_Bik9mbgUUF9q7Nmh1pQJ4qg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-11-05  0:13         ` Minchan Kim
2015-11-05  0:42           ` Andy Lutomirski
2015-11-05  0:56             ` Minchan Kim
2015-11-05  1:29               ` Andy Lutomirski
     [not found]                 ` <CALCETrWWgbPNwCr-=LF8p33H25C_aNS5vy4wd3NUap6SmrsmkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-11-05  1:48                   ` Minchan Kim
2015-11-04 20:00     ` Shaohua Li
     [not found]       ` <20151104200006.GA46783-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-11-04 21:16         ` Daniel Micay
     [not found]           ` <563A7591.7080607-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-11-04 21:29             ` Daniel Micay
2015-11-05  1:33         ` Minchan Kim
2015-11-05  1:37           ` Minchan Kim [this message]
2015-11-04 21:43       ` Andy Lutomirski
2015-12-01 22:30       ` John Stultz
2015-11-04  1:25 ` [PATCH v2 02/13] mm: define MADV_FREE for some arches Minchan Kim
2015-11-04  1:25 ` [PATCH v2 03/13] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures Minchan Kim
2015-11-04  1:25 ` [PATCH v2 05/13] mm: move lazily freed pages to inactive list Minchan Kim
2015-11-04  1:26 ` [PATCH v2 06/13] mm: clear PG_dirty to mark page freeable Minchan Kim
2015-11-04  1:26 ` [PATCH v2 07/13] mm: mark stable page dirty in KSM Minchan Kim
2015-11-04  1:26 ` [PATCH v2 09/13] sparc: add pmd_[dirty|mkclean] for THP Minchan Kim
     [not found] ` <1446600367-7976-1-git-send-email-minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-11-04  1:25   ` [PATCH v2 04/13] mm: free swp_entry in madvise_free Minchan Kim
2015-11-04  1:26   ` [PATCH v2 08/13] x86: add pmd_[dirty|mkclean] for THP Minchan Kim
2015-11-04  1:26   ` [PATCH v2 10/13] powerpc: " Minchan Kim
2015-11-04  1:26 ` [PATCH v2 11/13] arm: add pmd_mkclean " Minchan Kim
2015-11-04  1:26 ` [PATCH v2 12/13] arm64: " Minchan Kim
2015-11-04  1:26 ` [PATCH v2 13/13] mm: don't split THP page when syscall is called Minchan Kim
2015-12-05 11:10 ` [PATCH v2 00/13] MADV_FREE support Pavel Machek
2015-12-05 15:51   ` Daniel Micay

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151105013741.GI7357@bbox \
    --to=minchan-dgejt+ai2ygdnm+yrofe0a@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=bmaurer-b10kYP2dOMg@public.gmane.org \
    --cc=danielmicay-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
    --cc=hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=je-b10kYP2dOMg@public.gmane.org \
    --cc=john.stultz-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org \
    --cc=kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org \
    --cc=kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org \
    --cc=linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=mgorman-l3A5Bk7waGM@public.gmane.org \
    --cc=mhocko-AlSwsSmVLrQ@public.gmane.org \
    --cc=mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).