From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vlastimil Babka Subject: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Tue, 03 Feb 2015 09:19:15 +0100 Message-ID: <54D08483.40209@suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <54CFF8AC.6010102@intel.com> Sender: owner-linux-mm@kvack.org To: Dave Hansen , Mel Gorman , linux-mm@kvack.org Cc: Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org List-Id: linux-api@vger.kernel.org [CC linux-api, man pages] On 02/02/2015 11:22 PM, Dave Hansen wrote: > On 02/02/2015 08:55 AM, Mel Gorman wrote: >> This patch identifies when a thread is frequently calling MADV_DONTNEED >> on the same region of memory and starts ignoring the hint. On an 8-core >> single-socket machine this was the impact on ebizzy using glibc 2.19. > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > called: > >> MADV_DONTNEED >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > So if we have anything depending on the behavior that it's _always_ > zero-filled after an MADV_DONTNEED, this will break it. OK, so that's a third person (including me) who understood it as a zero-fill guarantee. I think the man page should be clarified (if it's indeed not guaranteed), or we have a bug. The implementation actually skips MADV_DONTNEED for VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. I'm not sure about VM_PFNMAP, these are probably special enough. For mlock, one could expect that mlocking and MADV_DONTNEED would be in some opposition, but it's not documented in the manpage AFAIK. Neither is the hugetlb case, which could be really unexpected by the user. Next, what the man page says about guarantees: "The kernel is free to ignore the advice." - that would suggest that nothing is guaranteed "This call does not influence the semantics of the application (except in the case of MADV_DONTNEED)" - that depends if the reader understands it as "does influence by MADV_DONTNEED" or "may influence by MADV_DONTNEED" - btw, isn't MADV_DONTFORK another exception that does influence the semantics? And since it's mentioned as a workaround for some hardware, is it OK to ignore this advice? And the part you already cited: "Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill on-demand pages for mappings without an underlying file." - The word "will result" did sound as a guarantee at least to me. So here it could be changed to "may result (unless the advice is ignored)"? And if we agree that there is indeed no guarantee, what's the actual semantic difference from MADV_FREE? I guess none? So there's only a possible perfomance difference? Vlastimil -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Tue, 3 Feb 2015 12:53:01 +0200 Message-ID: <20150203105301.GC14259@node.dhcp.inet.fi> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <54D08483.40209-AlSwsSmVLrQ@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Vlastimil Babka Cc: Dave Hansen , Mel Gorman , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Minchan Kim , Andrew Morton , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: linux-api@vger.kernel.org On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: > [CC linux-api, man pages] > > On 02/02/2015 11:22 PM, Dave Hansen wrote: > > On 02/02/2015 08:55 AM, Mel Gorman wrote: > >> This patch identifies when a thread is frequently calling MADV_DONTNEED > >> on the same region of memory and starts ignoring the hint. On an 8-core > >> single-socket machine this was the impact on ebizzy using glibc 2.19. > > > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > > called: > > > >> MADV_DONTNEED > >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > > > So if we have anything depending on the behavior that it's _always_ > > zero-filled after an MADV_DONTNEED, this will break it. > > OK, so that's a third person (including me) who understood it as a zero-fill > guarantee. I think the man page should be clarified (if it's indeed not > guaranteed), or we have a bug. > > The implementation actually skips MADV_DONTNEED for > VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. It doesn't skip. It fails with -EINVAL. Or I miss something. > - The word "will result" did sound as a guarantee at least to me. So here it > could be changed to "may result (unless the advice is ignored)"? It's too late to fix documentation. Applications already depends on the beheviour. -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Tue, 3 Feb 2015 11:16:00 +0000 Message-ID: <20150203111600.GR2395@suse.de> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Return-path: Content-Disposition: inline In-Reply-To: <54D08483.40209@suse.cz> Sender: owner-linux-mm@kvack.org To: Vlastimil Babka Cc: Dave Hansen , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org List-Id: linux-api@vger.kernel.org On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: > [CC linux-api, man pages] > > On 02/02/2015 11:22 PM, Dave Hansen wrote: > > On 02/02/2015 08:55 AM, Mel Gorman wrote: > >> This patch identifies when a thread is frequently calling MADV_DONTNEED > >> on the same region of memory and starts ignoring the hint. On an 8-core > >> single-socket machine this was the impact on ebizzy using glibc 2.19. > > > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > > called: > > > >> MADV_DONTNEED > >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > > > So if we have anything depending on the behavior that it's _always_ > > zero-filled after an MADV_DONTNEED, this will break it. > > OK, so that's a third person (including me) who understood it as a zero-fill > guarantee. I think the man page should be clarified (if it's indeed not > guaranteed), or we have a bug. > > The implementation actually skips MADV_DONTNEED for > VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. > This was the first reason why I did not consider the zero-filling to be a guarantee. That said, at this point I'm also not considering pushing this patch towards the kernel. I agree that this is a glibc bug so I've dropped a line to some glibc people to see what they think the approach should be. > I'm not sure about VM_PFNMAP, these are probably special enough. For mlock, one > could expect that mlocking and MADV_DONTNEED would be in some opposition, but > it's not documented in the manpage AFAIK. Neither is the hugetlb case, which > could be really unexpected by the user. > The equivalent posix page also lacks details on how exactly this flag should behave. hugetlb is sortof special in that it's always backed by a ram-based file where the contents can be refaulted. It gets hairy when the mapping has been created to look anonymous but is not anonymous really. The semantics of hugetlb have always been fuzzy. > Next, what the man page says about guarantees: > > "The kernel is free to ignore the advice." > > - that would suggest that nothing is guaranteed > Yep, another reason why I did not clear the page when ignoring the hint. > "This call does not influence the semantics of the application (except in the > case of MADV_DONTNEED)" > > - that depends if the reader understands it as "does influence by MADV_DONTNEED" > or "may influence by MADV_DONTNEED" > > - btw, isn't MADV_DONTFORK another exception that does influence the semantics? > And since it's mentioned as a workaround for some hardware, is it OK to ignore > this advice? > MADV_DONTFORK is also a Linux-specific extention. It happens to be one that if it gets ignored then the application will be very surprised. > And the part you already cited: > > "Subsequent accesses of pages in this range will succeed, but will result either > in reloading of the memory contents from the underlying mapped file (see > mmap(2)) or zero-fill on-demand pages for mappings without an underlying file." > > - The word "will result" did sound as a guarantee at least to me. So here it > could be changed to "may result (unless the advice is ignored)"? > The wording should be "may result" as there are circumstances where it gets ignored even without this prototype patch. > And if we agree that there is indeed no guarantee, what's the actual semantic > difference from MADV_FREE? I guess none? So there's only a possible perfomance > difference? > Timing. MADV_DONTNEED if it has an effect is immediate, is a heavier operations and RSS is reduced. MADV_FREE only has an impact in the future if there is memory pressure. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vlastimil Babka Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Tue, 03 Feb 2015 12:42:53 +0100 Message-ID: <54D0B43D.8000209@suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20150203105301.GC14259@node.dhcp.inet.fi> Sender: owner-linux-mm@kvack.org To: "Kirill A. Shutemov" Cc: Dave Hansen , Mel Gorman , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org List-Id: linux-api@vger.kernel.org On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: > On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: >> [CC linux-api, man pages] >> >> On 02/02/2015 11:22 PM, Dave Hansen wrote: >> > On 02/02/2015 08:55 AM, Mel Gorman wrote: >> >> This patch identifies when a thread is frequently calling MADV_DONTNEED >> >> on the same region of memory and starts ignoring the hint. On an 8-core >> >> single-socket machine this was the impact on ebizzy using glibc 2.19. >> > >> > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is >> > called: >> > >> >> MADV_DONTNEED >> >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources >> >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the >> >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. >> > >> > So if we have anything depending on the behavior that it's _always_ >> > zero-filled after an MADV_DONTNEED, this will break it. >> >> OK, so that's a third person (including me) who understood it as a zero-fill >> guarantee. I think the man page should be clarified (if it's indeed not >> guaranteed), or we have a bug. >> >> The implementation actually skips MADV_DONTNEED for >> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. > > It doesn't skip. It fails with -EINVAL. Or I miss something. No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in this case: * The application is attempting to release locked or shared pages (with MADV_DONTNEED). - that covers mlocking ok, not sure if the rest fits the "shared pages" case though. I dont see any check for other kinds of shared pages in the code. >> - The word "will result" did sound as a guarantee at least to me. So here it >> could be changed to "may result (unless the advice is ignored)"? > > It's too late to fix documentation. Applications already depends on the > beheviour. Right, so as long as they check for EINVAL, it should be safe. It appears that jemalloc does. I still wouldnt be sure just by reading the man page that the clearing is guaranteed whenever I dont get an error return value, though, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Tue, 3 Feb 2015 16:21:21 +0100 Message-ID: <20150203152121.GC8914@dhcp22.suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203111600.GR2395@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20150203111600.GR2395-l3A5Bk7waGM@public.gmane.org> Sender: linux-man-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Mel Gorman Cc: Vlastimil Babka , Dave Hansen , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Minchan Kim , Andrew Morton , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: linux-api@vger.kernel.org On Tue 03-02-15 11:16:00, Mel Gorman wrote: > On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: [...] > > And if we agree that there is indeed no guarantee, what's the actual semantic > > difference from MADV_FREE? I guess none? So there's only a possible perfomance > > difference? > > > > Timing. MADV_DONTNEED if it has an effect is immediate, is a heavier > operations and RSS is reduced. MADV_FREE only has an impact in the future > if there is memory pressure. JFTR. the man page for MADV_FREE has been proposed already (https://lkml.org/lkml/2014/12/5/63 should be the last version AFAIR). I do not see it in the man-pages git tree but the patch was not in time for 3.19 so I guess it will only appear in 3.20. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael Kerrisk (man-pages)" Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Tue, 03 Feb 2015 17:20:58 +0100 Message-ID: <54D0F56A.9050003@gmail.com> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <54D0B43D.8000209@suse.cz> Sender: owner-linux-mm@kvack.org To: Vlastimil Babka , "Kirill A. Shutemov" Cc: mtk.manpages@gmail.com, Dave Hansen , Mel Gorman , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, linux-man@vger.kernel.org, Hugh Dickins List-Id: linux-api@vger.kernel.org Hello Vlastimil Thanks for CCing me into this thread. On 02/03/2015 12:42 PM, Vlastimil Babka wrote: > On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: >> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: >>> [CC linux-api, man pages] >>> >>> On 02/02/2015 11:22 PM, Dave Hansen wrote: >>>> On 02/02/2015 08:55 AM, Mel Gorman wrote: >>>>> This patch identifies when a thread is frequently calling MADV_DONTNEED >>>>> on the same region of memory and starts ignoring the hint. On an 8-core >>>>> single-socket machine this was the impact on ebizzy using glibc 2.19. >>>> >>>> The manpage, at least, claims that we zero-fill after MADV_DONTNEED is >>>> called: >>>> >>>>> MADV_DONTNEED >>>>> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources >>>>> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the >>>>> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. >>>> >>>> So if we have anything depending on the behavior that it's _always_ >>>> zero-filled after an MADV_DONTNEED, this will break it. >>> >>> OK, so that's a third person (including me) who understood it as a zero-fill >>> guarantee. I think the man page should be clarified (if it's indeed not >>> guaranteed), or we have a bug. >>> >>> The implementation actually skips MADV_DONTNEED for >>> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. >> >> It doesn't skip. It fails with -EINVAL. Or I miss something. > > No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in > this case: > > * The application is attempting to release locked or shared pages (with > MADV_DONTNEED). Yes, there is that. But the page could be more explicit when discussing MADV_DONTNEED in the main text. I've done that. > - that covers mlocking ok, not sure if the rest fits the "shared pages" case > though. I dont see any check for other kinds of shared pages in the code. Agreed. "shared" here seems confused. I've removed it. And I've added mention of "Huge TLB pages" for this error. >>> - The word "will result" did sound as a guarantee at least to me. So here it >>> could be changed to "may result (unless the advice is ignored)"? >> >> It's too late to fix documentation. Applications already depends on the >> beheviour. > > Right, so as long as they check for EINVAL, it should be safe. It appears that > jemalloc does. So, first a brief question: in the cases where the call does not error out, are we agreed that in the current implementation, MADV_DONTNEED will always result in zero-filled pages when the region is faulted back in (when we consider pages that are not backed by a file)? > I still wouldnt be sure just by reading the man page that the clearing is > guaranteed whenever I dont get an error return value, though, I'm not quite sure what you want here. I mean: if there's an error, then the DONTNEED action didn't occur, right? Therefore, there won't be zero-filled pages. But, for what it's worth, I added "If the operation succeeds" at the start of that sentence beginning "Subsequent accesses...". Now, some history, explaining why the page is a bit of a mess, and for that matter why I could really use more help on it from MM folk (especially in the form of actual patches [1], rather than notes about deficiencies in the documentation), because: ***I simply cannot keep up with all of the details***. Once upon a time (Linux 2.4), there was madvise() with just 5 flags: MADV_NORMAL MADV_RANDOM MADV_SEQUENTIAL MADV_WILLNEED MADV_DONTNEED And already a dozen years ago, *I* added the text about MADV_DONTNEED. Back then, I believe it was true. I'm not sure if it's still true now, but I assume for the moment that it is, and await feedback. And the text saying that the call does not affect the semantics of memory access dates back even further (and was then true, MADV_DONTNEED aside). Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there is a semantic mismatch between the destructive MADV_DONTNEED and POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear on most other implementations. Since the original implementation, numerous pieces of cruft^W^W^W excellent new flags have been overloaded into this one system call. Some of those certainly violated the "does not change the semantics of the application" statement, but, sadly, the kernel developers who implemented MADV_REMOVE or MADV_DONTFORK did not think to send a patch to the man page for those new flags, one that might have noted that the semantics of the application are changed by such flags. Equally sadly, I did overlook to scan the bigger page when *I* added documentation of these flags to those pages, otherwise I might have caught that detail. So, just to repeat, I could really use more help on it from MM folk in the form of actual patches to the man page. Thanks, Michael [1] https://www.kernel.org/doc/man-pages/patches.html -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael Kerrisk (man-pages)" Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Tue, 03 Feb 2015 17:25:08 +0100 Message-ID: <54D0F664.3070606@gmail.com> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203111600.GR2395@suse.de> <20150203152121.GC8914@dhcp22.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20150203152121.GC8914-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Michal Hocko , Mel Gorman Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, minchan Kim , Dave Hansen , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.orgMinchan Kim , Andrew Morton , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: linux-api@vger.kernel.org On 02/03/2015 04:21 PM, Michal Hocko wrote: > On Tue 03-02-15 11:16:00, Mel Gorman wrote: >> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: > [...] >>> And if we agree that there is indeed no guarantee, what's the actual semantic >>> difference from MADV_FREE? I guess none? So there's only a possible perfomance >>> difference? >>> >> >> Timing. MADV_DONTNEED if it has an effect is immediate, is a heavier >> operations and RSS is reduced. MADV_FREE only has an impact in the future >> if there is memory pressure. > > JFTR. the man page for MADV_FREE has been proposed already > (https://lkml.org/lkml/2014/12/5/63 should be the last version AFAIR). I > do not see it in the man-pages git tree but the patch was not in time > for 3.19 so I guess it will only appear in 3.20. > Yikes! That patch was buried in the bottom of a locked filing cabinet in a disused lavatory. I unfortunately don't read every thread that comes my way, especially if it doesn't look like a man-pages patch (i.e., falls in the middle of an LKML thread that starts on another topic, and doesn't see linux-man@). I'll respond to that patch soon. (There are some problems that mean I could not accept it, AFAICT.) Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Wed, 4 Feb 2015 09:09:21 +0900 Message-ID: <20150204000921.GC3583@blaptop> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <54D0B43D.8000209-AlSwsSmVLrQ@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Andrew Morton , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Rik van Riel List-Id: linux-api@vger.kernel.org On Tue, Feb 03, 2015 at 12:42:53PM +0100, Vlastimil Babka wrote: > On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: > > On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: > >> [CC linux-api, man pages] > >> > >> On 02/02/2015 11:22 PM, Dave Hansen wrote: > >> > On 02/02/2015 08:55 AM, Mel Gorman wrote: > >> >> This patch identifies when a thread is frequently calling MADV_DONTNEED > >> >> on the same region of memory and starts ignoring the hint. On an 8-core > >> >> single-socket machine this was the impact on ebizzy using glibc 2.19. > >> > > >> > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > >> > called: > >> > > >> >> MADV_DONTNEED > >> >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > >> >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > >> >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > >> > > >> > So if we have anything depending on the behavior that it's _always_ > >> > zero-filled after an MADV_DONTNEED, this will break it. > >> > >> OK, so that's a third person (including me) who understood it as a zero-fill > >> guarantee. I think the man page should be clarified (if it's indeed not > >> guaranteed), or we have a bug. > >> > >> The implementation actually skips MADV_DONTNEED for > >> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. > > > > It doesn't skip. It fails with -EINVAL. Or I miss something. > > No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in > this case: > > * The application is attempting to release locked or shared pages (with > MADV_DONTNEED). > > - that covers mlocking ok, not sure if the rest fits the "shared pages" case > though. I dont see any check for other kinds of shared pages in the code. > > >> - The word "will result" did sound as a guarantee at least to me. So here it > >> could be changed to "may result (unless the advice is ignored)"? > > > > It's too late to fix documentation. Applications already depends on the > > beheviour. > > Right, so as long as they check for EINVAL, it should be safe. It appears that > jemalloc does. > > I still wouldnt be sure just by reading the man page that the clearing is > guaranteed whenever I dont get an error return value, though, > IMHO, Man page said "MADV_DONTNEED: Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file." Heap by allocated by malloc(3) is anonymous page so it's a mapping withtout an underlying file so userspace can expect zero-fill. Man page said "EINVAL: The application is attempting to release locked or shared pages (with MADV_DONTNEED)" So, user can expect the call on area by allocated by malloc(3) if he doesn't call mlock will always be successful. Man page said "madivse: This call does not influence the semantics of the application (except in the case of MADV_DONTNEED)" So, we shouldn't break MADV_DONTNEED's semantic which free pages instantly. It's a long time semantic and it was one of arguable issues on MADV_FREE Rik had tried long time ago to replace MADV_DONTNEED with MADV_FREE. -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vlastimil Babka Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Wed, 04 Feb 2015 14:46:00 +0100 Message-ID: <54D22298.3040504@suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <54D0F56A.9050003-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: "Michael Kerrisk (man-pages)" , "Kirill A. Shutemov" Cc: Dave Hansen , Mel Gorman , linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Minchan Kim , Andrew Morton , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Hugh Dickins List-Id: linux-api@vger.kernel.org On 02/03/2015 05:20 PM, Michael Kerrisk (man-pages) wrote: > Hello Vlastimil > > Thanks for CCing me into this thread. NP > On 02/03/2015 12:42 PM, Vlastimil Babka wrote: >> On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: >>> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: >>> >>> It doesn't skip. It fails with -EINVAL. Or I miss something. >> >> No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in >> this case: >> >> * The application is attempting to release locked or shared pages (with >> MADV_DONTNEED). > > Yes, there is that. But the page could be more explicit when discussing > MADV_DONTNEED in the main text. I've done that. > >> - that covers mlocking ok, not sure if the rest fits the "shared pages" case >> though. I dont see any check for other kinds of shared pages in the code. > > Agreed. "shared" here seems confused. I've removed it. And I've > added mention of "Huge TLB pages" for this error. > Thanks. >>>> - The word "will result" did sound as a guarantee at least to me. So here it >>>> could be changed to "may result (unless the advice is ignored)"? >>> >>> It's too late to fix documentation. Applications already depends on the >>> beheviour. >> >> Right, so as long as they check for EINVAL, it should be safe. It appears that >> jemalloc does. > > So, first a brief question: in the cases where the call does not error out, > are we agreed that in the current implementation, MADV_DONTNEED will > always result in zero-filled pages when the region is faulted back in > (when we consider pages that are not backed by a file)? I'd agree at this point. Also we should probably mention anonymously shared pages (shmem). I think they behave the same as file here. >> I still wouldnt be sure just by reading the man page that the clearing is >> guaranteed whenever I dont get an error return value, though, > > I'm not quite sure what you want here. I mean: if there's an error, I was just reiterating that the guarantee is not clear from if you consider all the statements in the man page. > then the DONTNEED action didn't occur, right? Therefore, there won't > be zero-filled pages. But, for what it's worth, I added "If the > operation succeeds" at the start of that sentence beginning "Subsequent > accesses...". Yes, that should clarify it. Thanks! > Now, some history, explaining why the page is a bit of a mess, > and for that matter why I could really use more help on it from MM > folk (especially in the form of actual patches [1], rather than notes > about deficiencies in the documentation), because: > > ***I simply cannot keep up with all of the details***. I see, and expected it would be like this. I would just send patch if the situation was clear, but here we should agree first, and I thought you should be involved from the beginning. > Once upon a time (Linux 2.4), there was madvise() with just 5 flags: > > MADV_NORMAL > MADV_RANDOM > MADV_SEQUENTIAL > MADV_WILLNEED > MADV_DONTNEED > > And already a dozen years ago, *I* added the text about MADV_DONTNEED. > Back then, I believe it was true. I'm not sure if it's still true now, > but I assume for the moment that it is, and await feedback. And the > text saying that the call does not affect the semantics of memory > access dates back even further (and was then true, MADV_DONTNEED aside). > > Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there > is a semantic mismatch between the destructive MADV_DONTNEED and > POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear > on most other implementations. > > Since the original implementation, numerous pieces of cruft^W^W^W > excellent new flags have been overloaded into this one system call. > Some of those certainly violated the "does not change the semantics > of the application" statement, but, sadly, the kernel developers who > implemented MADV_REMOVE or MADV_DONTFORK did not think to send a > patch to the man page for those new flags, one that might have noted > that the semantics of the application are changed by such flags. Equally > sadly, I did overlook to scan the bigger page when *I* added > documentation of these flags to those pages, otherwise I might have > caught that detail. > > So, just to repeat, I could really use more help on it from MM > folk in the form of actual patches to the man page. Thanks for the background. I'll try to remember to check for man-pages part when I review some api changing patch. > Thanks, > > Michael > > [1] https://www.kernel.org/doc/man-pages/patches.html > From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael Kerrisk (man-pages)" Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Wed, 4 Feb 2015 15:00:59 +0100 Message-ID: References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> Reply-To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Return-path: In-Reply-To: <54D22298.3040504-AlSwsSmVLrQ@public.gmane.org> Sender: linux-man-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins List-Id: linux-api@vger.kernel.org Hello Vlastimil, On 4 February 2015 at 14:46, Vlastimil Babka wrote: > On 02/03/2015 05:20 PM, Michael Kerrisk (man-pages) wrote: >> >> On 02/03/2015 12:42 PM, Vlastimil Babka wrote: >>> >>> On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: >>>> >>>> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: >>>> >>>> It doesn't skip. It fails with -EINVAL. Or I miss something. >>> >>> >>> No, I missed that. Thanks for pointing out. The manpage also explains >>> EINVAL in >>> this case: >>> >>> * The application is attempting to release locked or shared pages (with >>> MADV_DONTNEED). >> >> Yes, there is that. But the page could be more explicit when discussing >> MADV_DONTNEED in the main text. I've done that. >> >>> - that covers mlocking ok, not sure if the rest fits the "shared pages" >>> case >>> though. I dont see any check for other kinds of shared pages in the code. >> >> Agreed. "shared" here seems confused. I've removed it. And I've >> added mention of "Huge TLB pages" for this error. > > Thanks. I also added those cases for MADV_REMOVE, BTW. >>>>> - The word "will result" did sound as a guarantee at least to me. So >>>>> here it >>>>> could be changed to "may result (unless the advice is ignored)"? >>>> >>>> It's too late to fix documentation. Applications already depends on the >>>> beheviour. >>> >>> Right, so as long as they check for EINVAL, it should be safe. It appears >>> that >>> jemalloc does. >> >> >> So, first a brief question: in the cases where the call does not error >> out, >> are we agreed that in the current implementation, MADV_DONTNEED will >> always result in zero-filled pages when the region is faulted back in >> (when we consider pages that are not backed by a file)? > > > I'd agree at this point. Thanks for the confirmation. > Also we should probably mention anonymously shared pages (shmem). I think > they behave the same as file here. You mean tmpfs here, right? (I don't keep all of the synonyms straight.) >>> I still wouldnt be sure just by reading the man page that the clearing is >>> guaranteed whenever I dont get an error return value, though, >> >> I'm not quite sure what you want here. I mean: if there's an error, > > I was just reiterating that the guarantee is not clear from if you consider > all the statements in the man page. > >> then the DONTNEED action didn't occur, right? Therefore, there won't >> be zero-filled pages. But, for what it's worth, I added "If the >> operation succeeds" at the start of that sentence beginning "Subsequent >> accesses...". > > Yes, that should clarify it. Thanks! Okay. >> Now, some history, explaining why the page is a bit of a mess, >> and for that matter why I could really use more help on it from MM >> folk (especially in the form of actual patches [1], rather than notes >> about deficiencies in the documentation), because: >> >> ***I simply cannot keep up with all of the details***. > > I see, and expected it would be like this. I would just send patch if the > situation was clear, but here we should agree first, and I thought you > should be involved from the beginning. Sorry -- I should have made it clearer, this statement was not targeted at you personally, or even necessarily at this particular thread. It was a general comment, that came up sharply to me as I looked at how much cruft there is in the madvise() page. >> Once upon a time (Linux 2.4), there was madvise() with just 5 flags: >> >> MADV_NORMAL >> MADV_RANDOM >> MADV_SEQUENTIAL >> MADV_WILLNEED >> MADV_DONTNEED >> >> And already a dozen years ago, *I* added the text about MADV_DONTNEED. >> Back then, I believe it was true. I'm not sure if it's still true now, >> but I assume for the moment that it is, and await feedback. And the >> text saying that the call does not affect the semantics of memory >> access dates back even further (and was then true, MADV_DONTNEED aside). >> >> Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there >> is a semantic mismatch between the destructive MADV_DONTNEED and >> POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear >> on most other implementations. >> >> Since the original implementation, numerous pieces of cruft^W^W^W >> excellent new flags have been overloaded into this one system call. >> Some of those certainly violated the "does not change the semantics >> of the application" statement, but, sadly, the kernel developers who >> implemented MADV_REMOVE or MADV_DONTFORK did not think to send a >> patch to the man page for those new flags, one that might have noted >> that the semantics of the application are changed by such flags. Equally >> sadly, I did overlook to scan the bigger page when *I* added >> documentation of these flags to those pages, otherwise I might have >> caught that detail. >> >> So, just to repeat, I could really use more help on it from MM >> folk in the form of actual patches to the man page. > > Thanks for the background. I'll try to remember to check for man-pages part > when I review some api changing patch. That would be great. Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vlastimil Babka Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Wed, 04 Feb 2015 18:02:02 +0100 Message-ID: <54D2508A.9030804@suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org To: mtk.manpages@gmail.com Cc: "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins List-Id: linux-api@vger.kernel.org On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: > Hello Vlastimil, > > On 4 February 2015 at 14:46, Vlastimil Babka wrote: >>>> - that covers mlocking ok, not sure if the rest fits the "shared pages" >>>> case >>>> though. I dont see any check for other kinds of shared pages in the code. >>> >>> Agreed. "shared" here seems confused. I've removed it. And I've >>> added mention of "Huge TLB pages" for this error. >> >> Thanks. > > I also added those cases for MADV_REMOVE, BTW. Right. There's also the following for MADV_REMOVE that needs updating: "Currently, only shmfs/tmpfs supports this; other filesystems return with the error ENOSYS." - it's not just shmem/tmpfs anymore. It should be best to refer to fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to date. - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code is listed in the ERRORS section. >>>>>> - The word "will result" did sound as a guarantee at least to me. So >>>>>> here it >>>>>> could be changed to "may result (unless the advice is ignored)"? >>>>> >>>>> It's too late to fix documentation. Applications already depends on the >>>>> beheviour. >>>> >>>> Right, so as long as they check for EINVAL, it should be safe. It appears >>>> that >>>> jemalloc does. >>> >>> >>> So, first a brief question: in the cases where the call does not error >>> out, >>> are we agreed that in the current implementation, MADV_DONTNEED will >>> always result in zero-filled pages when the region is faulted back in >>> (when we consider pages that are not backed by a file)? >> >> >> I'd agree at this point. > > Thanks for the confirmation. > >> Also we should probably mention anonymously shared pages (shmem). I think >> they behave the same as file here. > > You mean tmpfs here, right? (I don't keep all of the synonyms straight.) shmem is tmpfs (that by itself would fit under "files" just fine), but also sys V segments created by shmget(2) and also mappings created by mmap with MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to refer to the full list. Thanks, Vlastimil -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael Kerrisk (man-pages)" Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Wed, 4 Feb 2015 20:24:27 +0100 Message-ID: References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> Reply-To: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <54D2508A.9030804-AlSwsSmVLrQ@public.gmane.org> Sender: linux-man-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins List-Id: linux-api@vger.kernel.org On 4 February 2015 at 18:02, Vlastimil Babka wrote: > On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: >> >> Hello Vlastimil, >> >> On 4 February 2015 at 14:46, Vlastimil Babka wrote: >>>>> >>>>> - that covers mlocking ok, not sure if the rest fits the "shared = pages" >>>>> case >>>>> though. I dont see any check for other kinds of shared pages in t= he >>>>> code. >>>> >>>> >>>> Agreed. "shared" here seems confused. I've removed it. And I've >>>> added mention of "Huge TLB pages" for this error. >>> >>> >>> Thanks. >> >> >> I also added those cases for MADV_REMOVE, BTW. > > > Right. There's also the following for MADV_REMOVE that needs updating= : > > "Currently, only shmfs/tmpfs supports this; other filesystems return = with > the error ENOSYS." > > - it's not just shmem/tmpfs anymore. It should be best to refer to > fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up = to > date. > > - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error = code is > listed in the ERRORS section. Yup, I recently added that as well, based on a patch from Jan Chaloupka= =2E >>>>>>> - The word "will result" did sound as a guarantee at least to m= e. So >>>>>>> here it >>>>>>> could be changed to "may result (unless the advice is ignored)"= ? >>>>>> >>>>>> It's too late to fix documentation. Applications already depends= on >>>>>> the >>>>>> beheviour. >>>>> >>>>> Right, so as long as they check for EINVAL, it should be safe. It >>>>> appears >>>>> that >>>>> jemalloc does. >>>> >>>> So, first a brief question: in the cases where the call does not e= rror >>>> out, >>>> are we agreed that in the current implementation, MADV_DONTNEED wi= ll >>>> always result in zero-filled pages when the region is faulted back= in >>>> (when we consider pages that are not backed by a file)? >>> >>> I'd agree at this point. >> >> Thanks for the confirmation. >> >>> Also we should probably mention anonymously shared pages (shmem). I= think >>> they behave the same as file here. >> >> You mean tmpfs here, right? (I don't keep all of the synonyms straig= ht.) > > shmem is tmpfs (that by itself would fit under "files" just fine), bu= t also > sys V segments created by shmget(2) and also mappings created by mmap= with > MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage = to > refer to the full list. So, how about this text: After a successful MADV_DONTNEED operation, the seman=E2=80= =90 tics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either reloading of the memory contents from the underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings. Thanks, Michael -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Thu, 5 Feb 2015 10:07:58 +0900 Message-ID: <20150205010757.GA20996@blaptop> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins List-Id: linux-api@vger.kernel.org Hello, On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-pages) wro= te: > On 4 February 2015 at 18:02, Vlastimil Babka wrote: > > On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: > >> > >> Hello Vlastimil, > >> > >> On 4 February 2015 at 14:46, Vlastimil Babka wrote: > >>>>> > >>>>> - that covers mlocking ok, not sure if the rest fits the "shared = pages" > >>>>> case > >>>>> though. I dont see any check for other kinds of shared pages in t= he > >>>>> code. > >>>> > >>>> > >>>> Agreed. "shared" here seems confused. I've removed it. And I've > >>>> added mention of "Huge TLB pages" for this error. > >>> > >>> > >>> Thanks. > >> > >> > >> I also added those cases for MADV_REMOVE, BTW. > > > > > > Right. There's also the following for MADV_REMOVE that needs updating= : > > > > "Currently, only shmfs/tmpfs supports this; other filesystems return = with > > the error ENOSYS." > > > > - it's not just shmem/tmpfs anymore. It should be best to refer to > > fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up = to > > date. > > > > - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error = code is > > listed in the ERRORS section. >=20 > Yup, I recently added that as well, based on a patch from Jan Chaloupka= . >=20 > >>>>>>> - The word "will result" did sound as a guarantee at least to m= e. So > >>>>>>> here it > >>>>>>> could be changed to "may result (unless the advice is ignored)"= ? > >>>>>> > >>>>>> It's too late to fix documentation. Applications already depends= on > >>>>>> the > >>>>>> beheviour. > >>>>> > >>>>> Right, so as long as they check for EINVAL, it should be safe. It > >>>>> appears > >>>>> that > >>>>> jemalloc does. > >>>> > >>>> So, first a brief question: in the cases where the call does not e= rror > >>>> out, > >>>> are we agreed that in the current implementation, MADV_DONTNEED wi= ll > >>>> always result in zero-filled pages when the region is faulted back= in > >>>> (when we consider pages that are not backed by a file)? > >>> > >>> I'd agree at this point. > >> > >> Thanks for the confirmation. > >> > >>> Also we should probably mention anonymously shared pages (shmem). I= think > >>> they behave the same as file here. > >> > >> You mean tmpfs here, right? (I don't keep all of the synonyms straig= ht.) > > > > shmem is tmpfs (that by itself would fit under "files" just fine), bu= t also > > sys V segments created by shmget(2) and also mappings created by mmap= with > > MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage = to > > refer to the full list. >=20 > So, how about this text: >=20 > After a successful MADV_DONTNEED operation, the seman=E2=80= =90 > tics of memory access in the specified region are > changed: subsequent accesses of pages in the range > will succeed, but will result in either reloading of > the memory contents from the underlying mapped file > (for shared file mappings, shared anonymous mappings, > and shmem-based techniques such as System V shared > memory segments) or zero-fill-on-demand pages for > anonymous private mappings. Hmm, I'd like to clarify. Whether it was intention or not, some of userspace developers thought about that syscall drop pages instantly if was no-error return so that they will see more free pages(ie, rss for the process will be decreased) with keeping the VMA. Can we rely on it? And we should make error section, too. "locked" covers mlock(2) and you said you will add hugetlb. Then, VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP? special mapping for some drivers? One more thing, "The kernel is free to ignore the advice". It conflicts "This call does not influence the semantics of the application (except in the case of MADV_DONTNEED)" so is it okay we can believe "The kernel is free to ingmore the advise except MADV_DONTNEED"? Thanks. --=20 Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Thu, 5 Feb 2015 16:41:02 +0100 Message-ID: <20150205154102.GA20607@dhcp22.suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins List-Id: linux-api@vger.kernel.org On Wed 04-02-15 20:24:27, Michael Kerrisk wrote: [...] > So, how about this text: >=20 > After a successful MADV_DONTNEED operation, the seman=E2=80= =90 > tics of memory access in the specified region are > changed: subsequent accesses of pages in the range > will succeed, but will result in either reloading of > the memory contents from the underlying mapped file " result in either providing the up-to-date contents of the underlying mapped file " Would be more precise IMO because reload might be interpreted as a major fault which is not necessarily the case (see below). > (for shared file mappings, shared anonymous mappings, > and shmem-based techniques such as System V shared > memory segments) or zero-fill-on-demand pages for > anonymous private mappings. Yes, this wording is better because many users are not aware of MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't mention that. I am just wondering whether it makes sense to mention that MADV_DONTNEED for shared mappings might be surprising and not freeing the backing pages thus not really freeing memory until there is a memory pressure. But maybe this is too implementation specific for a man page. What about the following wording on top of yours? " Please note that the MADV_DONTNEED hint on shared mappings might not lead to immediate freeing of pages in the range. The kernel is free to delay this until an appropriate moment. RSS of the calling process will be reduced however. " --=20 Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael Kerrisk (man-pages)" Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Fri, 06 Feb 2015 16:41:12 +0100 Message-ID: <54D4E098.8050004@gmail.com> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205010757.GA20996@blaptop> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20150205010757.GA20996@blaptop> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Minchan Kim Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins List-Id: linux-api@vger.kernel.org On 02/05/2015 02:07 AM, Minchan Kim wrote: > Hello, >=20 > On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-pages)= wrote: >> On 4 February 2015 at 18:02, Vlastimil Babka wrote: >>> On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: >>>> >>>> Hello Vlastimil, >>>> >>>> On 4 February 2015 at 14:46, Vlastimil Babka wrot= e: >>>>>>> >>>>>>> - that covers mlocking ok, not sure if the rest fits the "share= d pages" >>>>>>> case >>>>>>> though. I dont see any check for other kinds of shared pages in= the >>>>>>> code. >>>>>> >>>>>> >>>>>> Agreed. "shared" here seems confused. I've removed it. And I've >>>>>> added mention of "Huge TLB pages" for this error. >>>>> >>>>> >>>>> Thanks. >>>> >>>> >>>> I also added those cases for MADV_REMOVE, BTW. >>> >>> >>> Right. There's also the following for MADV_REMOVE that needs updati= ng: >>> >>> "Currently, only shmfs/tmpfs supports this; other filesystems retur= n with >>> the error ENOSYS." >>> >>> - it's not just shmem/tmpfs anymore. It should be best to refer to >>> fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) u= p to >>> date. >>> >>> - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither erro= r code is >>> listed in the ERRORS section. >> >> Yup, I recently added that as well, based on a patch from Jan Chalou= pka. >> >>>>>>>>> - The word "will result" did sound as a guarantee at least to= me. So >>>>>>>>> here it >>>>>>>>> could be changed to "may result (unless the advice is ignored= )"? >>>>>>>> >>>>>>>> It's too late to fix documentation. Applications already depen= ds on >>>>>>>> the >>>>>>>> beheviour. >>>>>>> >>>>>>> Right, so as long as they check for EINVAL, it should be safe. = It >>>>>>> appears >>>>>>> that >>>>>>> jemalloc does. >>>>>> >>>>>> So, first a brief question: in the cases where the call does not= error >>>>>> out, >>>>>> are we agreed that in the current implementation, MADV_DONTNEED = will >>>>>> always result in zero-filled pages when the region is faulted ba= ck in >>>>>> (when we consider pages that are not backed by a file)? >>>>> >>>>> I'd agree at this point. >>>> >>>> Thanks for the confirmation. >>>> >>>>> Also we should probably mention anonymously shared pages (shmem).= I think >>>>> they behave the same as file here. >>>> >>>> You mean tmpfs here, right? (I don't keep all of the synonyms stra= ight.) >>> >>> shmem is tmpfs (that by itself would fit under "files" just fine), = but also >>> sys V segments created by shmget(2) and also mappings created by mm= ap with >>> MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpag= e to >>> refer to the full list. >> >> So, how about this text: >> >> After a successful MADV_DONTNEED operation, the seman=E2= =80=90 >> tics of memory access in the specified region are >> changed: subsequent accesses of pages in the range >> will succeed, but will result in either reloading of >> the memory contents from the underlying mapped file >> (for shared file mappings, shared anonymous mappings, >> and shmem-based techniques such as System V shared >> memory segments) or zero-fill-on-demand pages for >> anonymous private mappings. >=20 > Hmm, I'd like to clarify. >=20 > Whether it was intention or not, some of userspace developers thought > about that syscall drop pages instantly if was no-error return so tha= t > they will see more free pages(ie, rss for the process will be decreas= ed) > with keeping the VMA. Can we rely on it? I do not know. Michael? > And we should make error section, too. > "locked" covers mlock(2) and you said you will add hugetlb. Then, > VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP? > special mapping for some drivers? I'm open for offers on what to add. =20 > One more thing, "The kernel is free to ignore the advice". > It conflicts "This call does not influence the semantics of the > application (except in the case of MADV_DONTNEED)" so > is it okay we can believe "The kernel is free to ingmore the advise > except MADV_DONTNEED"? I decided to just drop the sentence The kernel is free to ignore the advice. It creates misunderstandings, and does not really add information. Cheers, Michael --=20 Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael Kerrisk (man-pages)" Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Fri, 06 Feb 2015 16:57:50 +0100 Message-ID: <54D4E47E.4020509@gmail.com> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205154102.GA20607@dhcp22.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20150205154102.GA20607@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org To: Michal Hocko Cc: mtk.manpages@gmail.com, Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins List-Id: linux-api@vger.kernel.org Hi Michael On 02/05/2015 04:41 PM, Michal Hocko wrote: > On Wed 04-02-15 20:24:27, Michael Kerrisk wrote: > [...] >> So, how about this text: >> >> After a successful MADV_DONTNEED operation, the seman=E2= =80=90 >> tics of memory access in the specified region are >> changed: subsequent accesses of pages in the range >> will succeed, but will result in either reloading of >> the memory contents from the underlying mapped file >=20 > " > result in either providing the up-to-date contents of the underlying > mapped file > " Thanks! I did something like that. See below. > Would be more precise IMO because reload might be interpreted as a majo= r > fault which is not necessarily the case (see below). >=20 >> (for shared file mappings, shared anonymous mappings, >> and shmem-based techniques such as System V shared >> memory segments) or zero-fill-on-demand pages for >> anonymous private mappings. >=20 > Yes, this wording is better because many users are not aware of > MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't > mention that. (Michal, would you have a text to propose to add to the mmap(2) page? Maybe it would be useful to add something there.) >=20 > I am just wondering whether it makes sense to mention that MADV_DONTNEE= D > for shared mappings might be surprising and not freeing the backing > pages thus not really freeing memory until there is a memory > pressure. But maybe this is too implementation specific for a man > page. What about the following wording on top of yours? > " > Please note that the MADV_DONTNEED hint on shared mappings might not > lead to immediate freeing of pages in the range. The kernel is free to > delay this until an appropriate moment. RSS of the calling process will > be reduced however. > " Thanks! I added this, but dropped in the word "immediately" in the last=20 sentence, since I assume that was implied. So now we have: After a successful MADV_DONTNEED operation, the seman=E2=80= =90 tics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the mem=E2=80= =90 ory contents from the up-to-date contents of the under=E2=80= =90 lying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings. Note that, when applied to shared mappings, MADV_DONT=E2=80= =90 NEED might not lead to immediate freeing of the pages in the range. The kernel is free to delay freeing the pages until an appropriate moment. The resident set size (RSS) of the calling process will be immediately reduced however. The current draft of the page can be found in a branch, http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=3Ddraft_ma= dvise Thanks, Michael --=20 Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Fri, 6 Feb 2015 21:45:36 +0100 Message-ID: <20150206204536.GA24245@dhcp22.suse.cz> References: <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205154102.GA20607@dhcp22.suse.cz> <54D4E47E.4020509@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: <54D4E47E.4020509-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins List-Id: linux-api@vger.kernel.org On Fri 06-02-15 16:57:50, Michael Kerrisk wrote: [...] > > Yes, this wording is better because many users are not aware of > > MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doe= sn't > > mention that. >=20 > (Michal, would you have a text to propose to add to the mmap(2) page? > Maybe it would be useful to add something there.) I am half way on vacation, but I can cook a patch after I am back after week. =20 > > I am just wondering whether it makes sense to mention that MADV_DON= TNEED > > for shared mappings might be surprising and not freeing the backing > > pages thus not really freeing memory until there is a memory > > pressure. But maybe this is too implementation specific for a man > > page. What about the following wording on top of yours? > > " > > Please note that the MADV_DONTNEED hint on shared mappings might no= t > > lead to immediate freeing of pages in the range. The kernel is free= to > > delay this until an appropriate moment. RSS of the calling process = will > > be reduced however. > > " >=20 > Thanks! I added this, but dropped in the word "immediately" in the la= st=20 > sentence, since I assume that was implied. So now we have: >=20 > After a successful MADV_DONTNEED operation, the seman= =E2=80=90 > tics of memory access in the specified region ar= e > changed: subsequent accesses of pages in the range wil= l > succeed, but will result in either repopulating the mem= =E2=80=90 > ory contents from the up-to-date contents of the under= =E2=80=90 > lying mapped file (for shared file mappings, share= d > anonymous mappings, and shmem-based techniques such a= s > System V shared memory segments) or zero-fill-on-deman= d > pages for anonymous private mappings. >=20 > Note that, when applied to shared mappings, MADV_DONT= =E2=80=90 > NEED might not lead to immediate freeing of the pages i= n > the range. The kernel is free to delay freeing th= e > pages until an appropriate moment. The resident se= t > size (RSS) of the calling process will be immediatel= y > reduced however. This sounds good to me and it is definitely much better than the curren= t state. Thanks! > The current draft of the page can be found in a branch, > http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=3Ddraf= t_madvise >=20 > Thanks, >=20 > Michael >=20 >=20 >=20 > --=20 > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ --=20 Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Mon, 9 Feb 2015 15:46:00 +0900 Message-ID: <20150209064600.GA32300@blaptop> References: <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205010757.GA20996@blaptop> <54D4E098.8050004@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: <54D4E098.8050004-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins List-Id: linux-api@vger.kernel.org Hello, Michael On Fri, Feb 06, 2015 at 04:41:12PM +0100, Michael Kerrisk (man-pages) w= rote: > On 02/05/2015 02:07 AM, Minchan Kim wrote: > > Hello, > >=20 > > On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-page= s) wrote: > >> On 4 February 2015 at 18:02, Vlastimil Babka wrot= e: > >>> On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: > >>>> > >>>> Hello Vlastimil, > >>>> > >>>> On 4 February 2015 at 14:46, Vlastimil Babka wr= ote: > >>>>>>> > >>>>>>> - that covers mlocking ok, not sure if the rest fits the "sha= red pages" > >>>>>>> case > >>>>>>> though. I dont see any check for other kinds of shared pages = in the > >>>>>>> code. > >>>>>> > >>>>>> > >>>>>> Agreed. "shared" here seems confused. I've removed it. And I'v= e > >>>>>> added mention of "Huge TLB pages" for this error. > >>>>> > >>>>> > >>>>> Thanks. > >>>> > >>>> > >>>> I also added those cases for MADV_REMOVE, BTW. > >>> > >>> > >>> Right. There's also the following for MADV_REMOVE that needs upda= ting: > >>> > >>> "Currently, only shmfs/tmpfs supports this; other filesystems ret= urn with > >>> the error ENOSYS." > >>> > >>> - it's not just shmem/tmpfs anymore. It should be best to refer t= o > >>> fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more)= up to > >>> date. > >>> > >>> - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither er= ror code is > >>> listed in the ERRORS section. > >> > >> Yup, I recently added that as well, based on a patch from Jan Chal= oupka. > >> > >>>>>>>>> - The word "will result" did sound as a guarantee at least = to me. So > >>>>>>>>> here it > >>>>>>>>> could be changed to "may result (unless the advice is ignor= ed)"? > >>>>>>>> > >>>>>>>> It's too late to fix documentation. Applications already dep= ends on > >>>>>>>> the > >>>>>>>> beheviour. > >>>>>>> > >>>>>>> Right, so as long as they check for EINVAL, it should be safe= =2E It > >>>>>>> appears > >>>>>>> that > >>>>>>> jemalloc does. > >>>>>> > >>>>>> So, first a brief question: in the cases where the call does n= ot error > >>>>>> out, > >>>>>> are we agreed that in the current implementation, MADV_DONTNEE= D will > >>>>>> always result in zero-filled pages when the region is faulted = back in > >>>>>> (when we consider pages that are not backed by a file)? > >>>>> > >>>>> I'd agree at this point. > >>>> > >>>> Thanks for the confirmation. > >>>> > >>>>> Also we should probably mention anonymously shared pages (shmem= ). I think > >>>>> they behave the same as file here. > >>>> > >>>> You mean tmpfs here, right? (I don't keep all of the synonyms st= raight.) > >>> > >>> shmem is tmpfs (that by itself would fit under "files" just fine)= , but also > >>> sys V segments created by shmget(2) and also mappings created by = mmap with > >>> MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manp= age to > >>> refer to the full list. > >> > >> So, how about this text: > >> > >> After a successful MADV_DONTNEED operation, the sema= n=E2=80=90 > >> tics of memory access in the specified region a= re > >> changed: subsequent accesses of pages in the ran= ge > >> will succeed, but will result in either reloading = of > >> the memory contents from the underlying mapped fi= le > >> (for shared file mappings, shared anonymous mapping= s, > >> and shmem-based techniques such as System V shar= ed > >> memory segments) or zero-fill-on-demand pages f= or > >> anonymous private mappings. > >=20 > > Hmm, I'd like to clarify. > >=20 > > Whether it was intention or not, some of userspace developers thoug= ht > > about that syscall drop pages instantly if was no-error return so t= hat > > they will see more free pages(ie, rss for the process will be decre= ased) > > with keeping the VMA. Can we rely on it? >=20 > I do not know. Michael? It's important to identify difference between MADV_DONTNEED and MADV_FR= EE so it would be better to clear out in this chance. >=20 > > And we should make error section, too. > > "locked" covers mlock(2) and you said you will add hugetlb. Then, > > VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP? > > special mapping for some drivers? >=20 > I'm open for offers on what to add. I suggests from quote "LWN" http://lwn.net/Articles/162860/ "*special mapping* which is not made up of "normal" pages. It is usually created by device drivers which map special memory areas into user space" > =20 > > One more thing, "The kernel is free to ignore the advice". > > It conflicts "This call does not influence the semantics of the > > application (except in the case of MADV_DONTNEED)" so > > is it okay we can believe "The kernel is free to ingmore the advise > > except MADV_DONTNEED"? >=20 > I decided to just drop the sentence >=20 > The kernel is free to ignore the advice. >=20 > It creates misunderstandings, and does not really add information. Sounds good. >=20 > Cheers, >=20 > Michael >=20 > --=20 > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ --=20 Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Mon, 9 Feb 2015 15:50:45 +0900 Message-ID: <20150209065045.GB32300@blaptop> References: <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205154102.GA20607@dhcp22.suse.cz> <54D4E47E.4020509@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: <54D4E47E.4020509-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> Sender: linux-man-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: "Michael Kerrisk (man-pages)" Cc: Michal Hocko , Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins List-Id: linux-api@vger.kernel.org On Fri, Feb 06, 2015 at 04:57:50PM +0100, Michael Kerrisk (man-pages) w= rote: > Hi Michael >=20 > On 02/05/2015 04:41 PM, Michal Hocko wrote: > > On Wed 04-02-15 20:24:27, Michael Kerrisk wrote: > > [...] > >> So, how about this text: > >> > >> After a successful MADV_DONTNEED operation, the sema= n=E2=80=90 > >> tics of memory access in the specified region a= re > >> changed: subsequent accesses of pages in the ran= ge > >> will succeed, but will result in either reloading = of > >> the memory contents from the underlying mapped fi= le > >=20 > > " > > result in either providing the up-to-date contents of the underlyin= g > > mapped file > > " >=20 > Thanks! I did something like that. See below. >=20 > > Would be more precise IMO because reload might be interpreted as a = major > > fault which is not necessarily the case (see below). > >=20 > >> (for shared file mappings, shared anonymous mapping= s, > >> and shmem-based techniques such as System V shar= ed > >> memory segments) or zero-fill-on-demand pages f= or > >> anonymous private mappings. > >=20 > > Yes, this wording is better because many users are not aware of > > MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doe= sn't > > mention that. >=20 > (Michal, would you have a text to propose to add to the mmap(2) page? > Maybe it would be useful to add something there.) >=20 > >=20 > > I am just wondering whether it makes sense to mention that MADV_DON= TNEED > > for shared mappings might be surprising and not freeing the backing > > pages thus not really freeing memory until there is a memory > > pressure. But maybe this is too implementation specific for a man > > page. What about the following wording on top of yours? > > " > > Please note that the MADV_DONTNEED hint on shared mappings might no= t > > lead to immediate freeing of pages in the range. The kernel is free= to > > delay this until an appropriate moment. RSS of the calling process = will > > be reduced however. > > " >=20 > Thanks! I added this, but dropped in the word "immediately" in the la= st=20 > sentence, since I assume that was implied. So now we have: >=20 > After a successful MADV_DONTNEED operation, the seman= =E2=80=90 > tics of memory access in the specified region ar= e > changed: subsequent accesses of pages in the range wil= l > succeed, but will result in either repopulating the mem= =E2=80=90 > ory contents from the up-to-date contents of the under= =E2=80=90 > lying mapped file (for shared file mappings, share= d > anonymous mappings, and shmem-based techniques such a= s > System V shared memory segments) or zero-fill-on-deman= d > pages for anonymous private mappings. >=20 > Note that, when applied to shared mappings, MADV_DONT= =E2=80=90 > NEED might not lead to immediate freeing of the pages i= n > the range. The kernel is free to delay freeing th= e > pages until an appropriate moment. The resident se= t > size (RSS) of the calling process will be immediatel= y > reduced however. Looks good. So, I can parse it that anonymous private mappings will lea= d to immediate freeing of the pages in the range so it's clearly differen= t with MADV_FREE. >=20 > The current draft of the page can be found in a branch, > http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=3Ddraf= t_madvise >=20 > Thanks, >=20 > Michael >=20 >=20 >=20 > --=20 > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ --=20 Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael Kerrisk (man-pages)" Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Date: Mon, 09 Feb 2015 10:13:07 +0100 Message-ID: <54D87A23.40703@gmail.com> References: <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205010757.GA20996@blaptop> <54D4E098.8050004@gmail.com> <20150209064600.GA32300@blaptop> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20150209064600.GA32300@blaptop> Sender: owner-linux-mm@kvack.org To: Minchan Kim Cc: mtk.manpages@gmail.com, Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins List-Id: linux-api@vger.kernel.org Hello Minchan On 02/09/2015 07:46 AM, Minchan Kim wrote: > Hello, Michael > > On Fri, Feb 06, 2015 at 04:41:12PM +0100, Michael Kerrisk (man-pages) wrote: >> On 02/05/2015 02:07 AM, Minchan Kim wrote: >>> Hello, >>> >>> On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-pages) wrote: >>>> On 4 February 2015 at 18:02, Vlastimil Babka wrote: >>>>> On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: [...] >>> And we should make error section, too. >>> "locked" covers mlock(2) and you said you will add hugetlb. Then, >>> VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP? >>> special mapping for some drivers? >> >> I'm open for offers on what to add. > > I suggests from quote "LWN" http://lwn.net/Articles/162860/ > "*special mapping* which is not made up of "normal" pages. > It is usually created by device drivers which map special memory areas > into user space" Thanks. I've added mention of VM_PFNMAP in the discussion of both MADV_DONTNEED and MADV_REMOVE, and noted that both of those operations will give an error when applied to VM_PFNMAP pages. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f180.google.com (mail-we0-f180.google.com [74.125.82.180]) by kanga.kvack.org (Postfix) with ESMTP id C2C926B0032 for ; Mon, 2 Feb 2015 11:55:31 -0500 (EST) Received: by mail-we0-f180.google.com with SMTP id m14so40223214wev.11 for ; Mon, 02 Feb 2015 08:55:31 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id ft7si38145686wjb.169.2015.02.02.08.55.29 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 02 Feb 2015 08:55:29 -0800 (PST) Date: Mon, 2 Feb 2015 16:55:25 +0000 From: Mel Gorman Subject: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150202165525.GM2395@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Minchan Kim , Vlastimil Babka , Andrew Morton , linux-kernel@vger.kernel.org glibc malloc changed behaviour in glibc 2.10 to have per-thread arenas instead of creating new areans if the existing ones were contended. The decision appears to have been made so the allocator scales better but the downside is that madvise(MADV_DONTNEED) is now called for these per-thread areans during free. This tears down pages that would have previously remained. There is nothing wrong with this decision from a functional point of view but any threaded application that frequently allocates/frees the same-sized region is going to incur the full teardown and refault costs. This is extremely obvious in the ebizzy benchmark. At its core, threads are frequently freeing and allocating buffers of the same size. It is much faster on distributions with older versions of glibc. Profiles showed that a large amount of system CPU time was spent on tearing down and refaulting pages. This patch identifies when a thread is frequently calling MADV_DONTNEED on the same region of memory and starts ignoring the hint. On an 8-core single-socket machine this was the impact on ebizzy using glibc 2.19. ebizzy Overall Throughput 3.19.0-rc6 3.19.0-rc6 vanilla madvise-v1r1 Hmean Rsec-1 12619.93 ( 0.00%) 34807.02 (175.81%) Hmean Rsec-3 33434.19 ( 0.00%) 100733.77 (201.29%) Hmean Rsec-5 45796.68 ( 0.00%) 134257.34 (193.16%) Hmean Rsec-7 53146.93 ( 0.00%) 145512.85 (173.79%) Hmean Rsec-12 55132.87 ( 0.00%) 145560.86 (164.02%) Hmean Rsec-18 54846.52 ( 0.00%) 145120.79 (164.59%) Hmean Rsec-24 54368.95 ( 0.00%) 142733.89 (162.53%) Hmean Rsec-30 54388.86 ( 0.00%) 141424.09 (160.02%) Hmean Rsec-32 54047.11 ( 0.00%) 139151.76 (157.46%) And the system CPU usage was also much reduced 3.19.0-rc6 3.19.0-rc6 vanilla madvise-v1r1 User 2647.19 8347.26 System 5742.90 42.42 Elapsed 1350.60 1350.65 It's even more ridiculous on a 4 socket machine ebizzy Overall Throughput 3.19.0-rc6 3.19.0-rc6 vanilla madvise-v1r1 Hmean Rsec-1 5354.37 ( 0.00%) 12838.61 (139.78%) Hmean Rsec-4 10338.41 ( 0.00%) 50514.52 (388.61%) Hmean Rsec-7 7766.33 ( 0.00%) 88555.30 (1040.25%) Hmean Rsec-12 7188.40 ( 0.00%) 154180.78 (2044.86%) Hmean Rsec-21 7001.82 ( 0.00%) 266555.51 (3706.95%) Hmean Rsec-30 8975.08 ( 0.00%) 314369.88 (3402.70%) Hmean Rsec-48 12136.53 ( 0.00%) 358525.74 (2854.10%) Hmean Rsec-79 12607.37 ( 0.00%) 341646.49 (2609.89%) Hmean Rsec-110 12563.37 ( 0.00%) 338058.65 (2590.83%) Hmean Rsec-141 11701.85 ( 0.00%) 331255.78 (2730.80%) Hmean Rsec-172 10987.39 ( 0.00%) 312003.62 (2739.65%) Hmean Rsec-192 12050.46 ( 0.00%) 296401.88 (2359.67%) 3.19.0-rc6 3.19.0-rc6 vanilla madvise-v1r1 User 4136.44 53506.65 System 50262.68 906.49 Elapsed 1802.07 1801.99 Note in both cases that the elapsed time is similar because the benchmark is configured to run for a fixed duration. MADV_FREE would have a lower cost if the underlying allocator used it but there is no guarantee that allocators will use it. Arguably the kernel has no business preventing an application developer shooting themselves in a foot but this is a case where it's relatively easy to detect the bad behaviour and avoid it. Signed-off-by: Mel Gorman --- fs/exec.c | 4 ++++ include/linux/sched.h | 5 +++++ kernel/fork.c | 5 +++++ mm/madvise.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 70 insertions(+) diff --git a/fs/exec.c b/fs/exec.c index ad8798e26be9..5c691fcc32f4 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1551,6 +1551,10 @@ static int do_execveat_common(int fd, struct filename *filename, current->in_execve = 0; acct_update_integrals(current); task_numa_free(current); + if (current->madvise_state) { + kfree(current->madvise_state); + current->madvise_state = NULL; + } free_bprm(bprm); kfree(pathbuf); putname(filename); diff --git a/include/linux/sched.h b/include/linux/sched.h index 8db31ef98d2f..b6706bdb27fd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1271,6 +1271,9 @@ enum perf_event_task_context { perf_nr_task_contexts, }; +/* mm/madvise.c */ +struct madvise_state_info; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ void *stack; @@ -1637,6 +1640,8 @@ struct task_struct { struct page_frag task_frag; + struct madvise_state_info *madvise_state; + #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif diff --git a/kernel/fork.c b/kernel/fork.c index 4dc2ddade9f1..6d8dd1379240 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -246,6 +246,11 @@ void __put_task_struct(struct task_struct *tsk) delayacct_tsk_free(tsk); put_signal_struct(tsk->signal); + if (current->madvise_state) { + kfree(current->madvise_state); + current->madvise_state = NULL; + } + if (!profile_handoff_task(tsk)) free_task(tsk); } diff --git a/mm/madvise.c b/mm/madvise.c index a271adc93289..907bb0922711 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -19,6 +19,7 @@ #include #include #include +#include /* * Any behaviour which results in changes to the vma->vm_flags needs to @@ -251,6 +252,57 @@ static long madvise_willneed(struct vm_area_struct *vma, return 0; } +#define MADVISE_HASH VMACACHE_HASH +#define MADVISE_STATE_SIZE VMACACHE_SIZE +#define MADVISE_THRESHOLD 8 + +struct madvise_state_info { + unsigned long start; + unsigned long end; + int count; + unsigned long jiffies; +}; + +/* Returns true if userspace is continually dropping the same address range */ +static bool ignore_madvise_hint(unsigned long start, unsigned long end) +{ + int i; + + if (!current->madvise_state) + current->madvise_state = kzalloc(sizeof(struct madvise_state_info) * MADVISE_STATE_SIZE, GFP_KERNEL); + if (!current->madvise_state) + return false; + + i = VMACACHE_HASH(start); + if (current->madvise_state[i].start != start || + current->madvise_state[i].end != end) { + /* cache miss */ + current->madvise_state[i].start = start; + current->madvise_state[i].end = end; + current->madvise_state[i].count = 0; + current->madvise_state[i].jiffies = jiffies; + } else { + /* cache hit */ + unsigned long reset = current->madvise_state[i].jiffies + HZ; + if (time_after(jiffies, reset)) { + /* + * If it is a second since the last madvise on this + * range or since madvise hints got ignored then reset + * the counts and apply the hint again. + */ + current->madvise_state[i].count = 0; + current->madvise_state[i].jiffies = jiffies; + } else + current->madvise_state[i].count++; + + if (current->madvise_state[i].count > MADVISE_THRESHOLD) + return true; + current->madvise_state[i].jiffies = jiffies; + } + + return false; +} + /* * Application no longer needs these pages. If the pages are dirty, * it's OK to just throw them away. The app will be more careful about @@ -278,6 +330,10 @@ static long madvise_dontneed(struct vm_area_struct *vma, if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) return -EINVAL; + /* Ignore hint if madvise is continually dropping the same range */ + if (ignore_madvise_hint(start, end)) + return 0; + if (unlikely(vma->vm_flags & VM_NONLINEAR)) { struct zap_details details = { .nonlinear_vma = vma, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f54.google.com (mail-pa0-f54.google.com [209.85.220.54]) by kanga.kvack.org (Postfix) with ESMTP id 835176B0038 for ; Mon, 2 Feb 2015 17:05:08 -0500 (EST) Received: by mail-pa0-f54.google.com with SMTP id eu11so87609635pac.13 for ; Mon, 02 Feb 2015 14:05:08 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id b11si99257pdm.95.2015.02.02.14.05.07 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 02 Feb 2015 14:05:07 -0800 (PST) Date: Mon, 2 Feb 2015 14:05:06 -0800 From: Andrew Morton Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-Id: <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> In-Reply-To: <20150202165525.GM2395@suse.de> References: <20150202165525.GM2395@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org On Mon, 2 Feb 2015 16:55:25 +0000 Mel Gorman wrote: > glibc malloc changed behaviour in glibc 2.10 to have per-thread arenas > instead of creating new areans if the existing ones were contended. > The decision appears to have been made so the allocator scales better but the > downside is that madvise(MADV_DONTNEED) is now called for these per-thread > areans during free. This tears down pages that would have previously > remained. There is nothing wrong with this decision from a functional point > of view but any threaded application that frequently allocates/frees the > same-sized region is going to incur the full teardown and refault costs. MADV_DONTNEED has been there for many years. How could this problem not have been noticed during glibc 2.10 development/testing? Is there some more recent kernel change which is triggering this? > This patch identifies when a thread is frequently calling MADV_DONTNEED > on the same region of memory and starts ignoring the hint. That's pretty nasty-looking :( And presumably there are all sorts of behaviours which will still trigger the problem but which will avoid the start/end equality test in ignore_madvise_hint()? Really, this is a glibc problem and only a glibc problem. MADV_DONTNEED is unavoidably expensive and glibc is calling MADV_DONTNEED for a region which it *does* need. Is there something preventing this from being addressed within glibc? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id 050FA6B0038 for ; Mon, 2 Feb 2015 17:18:30 -0500 (EST) Received: by mail-wg0-f52.google.com with SMTP id y19so41337218wgg.11 for ; Mon, 02 Feb 2015 14:18:29 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id mv7si25631526wib.52.2015.02.02.14.18.27 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 02 Feb 2015 14:18:28 -0800 (PST) Date: Mon, 2 Feb 2015 22:18:24 +0000 From: Mel Gorman Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150202221824.GN2395@suse.de> References: <20150202165525.GM2395@suse.de> <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org On Mon, Feb 02, 2015 at 02:05:06PM -0800, Andrew Morton wrote: > On Mon, 2 Feb 2015 16:55:25 +0000 Mel Gorman wrote: > > > glibc malloc changed behaviour in glibc 2.10 to have per-thread arenas > > instead of creating new areans if the existing ones were contended. > > The decision appears to have been made so the allocator scales better but the > > downside is that madvise(MADV_DONTNEED) is now called for these per-thread > > areans during free. This tears down pages that would have previously > > remained. There is nothing wrong with this decision from a functional point > > of view but any threaded application that frequently allocates/frees the > > same-sized region is going to incur the full teardown and refault costs. > > MADV_DONTNEED has been there for many years. How could this problem > not have been noticed during glibc 2.10 development/testing? I do not know. I only spotted it due to switching distributions. Looping allocations and frees of the same sizes is considered inefficient and it might have been dismissed on those grounds. It's probably less noticeable when it only affects threaded applications. > Is there > some more recent kernel change which is triggering this? > Not that I'm aware of. > > This patch identifies when a thread is frequently calling MADV_DONTNEED > > on the same region of memory and starts ignoring the hint. > > That's pretty nasty-looking :( > Yep, it is but we're very limited in terms of what we can do within the kernel here. > And presumably there are all sorts of behaviours which will still > trigger the problem but which will avoid the start/end equality test in > ignore_madvise_hint()? > Yes. I would expect that a simple pattern of multiple allocs followed by multiple frees in a loop would also trigger it. > Really, this is a glibc problem and only a glibc problem. > MADV_DONTNEED is unavoidably expensive and glibc is calling > MADV_DONTNEED for a region which it *does* need. To be fair to glibc, it calls it on a region it *thinks* it doesn't need only to reuse it immediately afterwards because of how the benchmark is implemented. > Is there something > preventing this from being addressed within glibc? I doubt it other than I expect they'll punt it back and blame either the application for being stupid or the kernel for being slow. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by kanga.kvack.org (Postfix) with ESMTP id A083F6B0038 for ; Mon, 2 Feb 2015 17:22:39 -0500 (EST) Received: by mail-pa0-f47.google.com with SMTP id lj1so87684636pab.6 for ; Mon, 02 Feb 2015 14:22:39 -0800 (PST) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id km8si111518pbc.254.2015.02.02.14.22.38 for ; Mon, 02 Feb 2015 14:22:38 -0800 (PST) Message-ID: <54CFF8AC.6010102@intel.com> Date: Mon, 02 Feb 2015 14:22:36 -0800 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> In-Reply-To: <20150202165525.GM2395@suse.de> Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , linux-mm@kvack.org Cc: Minchan Kim , Vlastimil Babka , Andrew Morton , linux-kernel@vger.kernel.org On 02/02/2015 08:55 AM, Mel Gorman wrote: > This patch identifies when a thread is frequently calling MADV_DONTNEED > on the same region of memory and starts ignoring the hint. On an 8-core > single-socket machine this was the impact on ebizzy using glibc 2.19. The manpage, at least, claims that we zero-fill after MADV_DONTNEED is called: > MADV_DONTNEED > Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. So if we have anything depending on the behavior that it's _always_ zero-filled after an MADV_DONTNEED, this will break it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by kanga.kvack.org (Postfix) with ESMTP id 737816B0038 for ; Mon, 2 Feb 2015 17:35:43 -0500 (EST) Received: by mail-pa0-f53.google.com with SMTP id kx10so87789059pab.12 for ; Mon, 02 Feb 2015 14:35:43 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id ku8si151963pab.155.2015.02.02.14.35.42 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 02 Feb 2015 14:35:42 -0800 (PST) Date: Mon, 2 Feb 2015 14:35:41 -0800 From: Andrew Morton Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-Id: <20150202143541.1efdd2b571413200cb9a4698@linux-foundation.org> In-Reply-To: <20150202221824.GN2395@suse.de> References: <20150202165525.GM2395@suse.de> <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> <20150202221824.GN2395@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org On Mon, 2 Feb 2015 22:18:24 +0000 Mel Gorman wrote: > > Is there something > > preventing this from being addressed within glibc? > > I doubt it other than I expect they'll punt it back and blame either the > application for being stupid or the kernel for being slow. *Is* the application being stupid? What is it actually doing? Something like pthread_routine() { p = malloc(X); do_some(work); free(p); return; } ? If so, that doesn't seem stupid? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f46.google.com (mail-oi0-f46.google.com [209.85.218.46]) by kanga.kvack.org (Postfix) with ESMTP id 320FD900015 for ; Mon, 2 Feb 2015 19:26:37 -0500 (EST) Received: by mail-oi0-f46.google.com with SMTP id a141so46941732oig.5 for ; Mon, 02 Feb 2015 16:26:36 -0800 (PST) Received: from smtp2.provo.novell.com (smtp2.provo.novell.com. [137.65.250.81]) by mx.google.com with ESMTPS id px7si5373611obc.58.2015.02.02.16.26.35 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 02 Feb 2015 16:26:36 -0800 (PST) Message-ID: <1422923185.14964.2.camel@stgolabs.net> Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints From: Davidlohr Bueso Date: Mon, 02 Feb 2015 16:26:25 -0800 In-Reply-To: <20150202143541.1efdd2b571413200cb9a4698@linux-foundation.org> References: <20150202165525.GM2395@suse.de> <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> <20150202221824.GN2395@suse.de> <20150202143541.1efdd2b571413200cb9a4698@linux-foundation.org> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Mel Gorman , linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org On Mon, 2015-02-02 at 14:35 -0800, Andrew Morton wrote: > On Mon, 2 Feb 2015 22:18:24 +0000 Mel Gorman wrote: > > > > Is there something > > > preventing this from being addressed within glibc? > > > > I doubt it other than I expect they'll punt it back and blame either the > > application for being stupid or the kernel for being slow. > > *Is* the application being stupid? What is it actually doing? > Something like > > pthread_routine() > { > p = malloc(X); > do_some(work); > free(p); Ebizzy adds a time based loop in there. But yeah, pretty much a standard pthread model. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f181.google.com (mail-we0-f181.google.com [74.125.82.181]) by kanga.kvack.org (Postfix) with ESMTP id 861736B0038 for ; Tue, 3 Feb 2015 04:47:25 -0500 (EST) Received: by mail-we0-f181.google.com with SMTP id k48so43923628wev.12 for ; Tue, 03 Feb 2015 01:47:25 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id u1si28017436wiy.37.2015.02.03.01.47.23 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 03 Feb 2015 01:47:23 -0800 (PST) Date: Tue, 3 Feb 2015 09:47:18 +0000 From: Mel Gorman Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203094718.GO2395@suse.de> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <54CFF8AC.6010102@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , Andrew Morton , linux-kernel@vger.kernel.org On Mon, Feb 02, 2015 at 02:22:36PM -0800, Dave Hansen wrote: > On 02/02/2015 08:55 AM, Mel Gorman wrote: > > This patch identifies when a thread is frequently calling MADV_DONTNEED > > on the same region of memory and starts ignoring the hint. On an 8-core > > single-socket machine this was the impact on ebizzy using glibc 2.19. > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > called: > It also claims that the kernel is free to ignore the advice. > > MADV_DONTNEED > > Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > > associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > > underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > So if we have anything depending on the behavior that it's _always_ > zero-filled after an MADV_DONTNEED, this will break it. True. I'd be surprised if any application depended on that but to be safe, an ignored hint could clear the pages. It would still be cheaper than a full teardown and refault. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f53.google.com (mail-la0-f53.google.com [209.85.215.53]) by kanga.kvack.org (Postfix) with ESMTP id CECE96B0038 for ; Tue, 3 Feb 2015 05:48:07 -0500 (EST) Received: by mail-la0-f53.google.com with SMTP id gq15so50211354lab.12 for ; Tue, 03 Feb 2015 02:48:07 -0800 (PST) Received: from jenni2.inet.fi (mta-out1.inet.fi. [62.71.2.227]) by mx.google.com with ESMTP id w8si6934201lbb.25.2015.02.03.02.48.05 for ; Tue, 03 Feb 2015 02:48:05 -0800 (PST) Date: Tue, 3 Feb 2015 12:47:56 +0200 From: "Kirill A. Shutemov" Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203104756.GB14259@node.dhcp.inet.fi> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <20150203094718.GO2395@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150203094718.GO2395@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Dave Hansen , linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , Andrew Morton , linux-kernel@vger.kernel.org On Tue, Feb 03, 2015 at 09:47:18AM +0000, Mel Gorman wrote: > On Mon, Feb 02, 2015 at 02:22:36PM -0800, Dave Hansen wrote: > > On 02/02/2015 08:55 AM, Mel Gorman wrote: > > > This patch identifies when a thread is frequently calling MADV_DONTNEED > > > on the same region of memory and starts ignoring the hint. On an 8-core > > > single-socket machine this was the impact on ebizzy using glibc 2.19. > > > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > > called: > > > > It also claims that the kernel is free to ignore the advice. > > > > MADV_DONTNEED > > > Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > > > associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > > > underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > > > So if we have anything depending on the behavior that it's _always_ > > zero-filled after an MADV_DONTNEED, this will break it. > > True. I'd be surprised if any application depended on that IIUC, jemalloc depends on this[1]. [1] https://github.com/jemalloc/jemalloc/blob/dev/src/chunk_mmap.c#L117 -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f48.google.com (mail-wg0-f48.google.com [74.125.82.48]) by kanga.kvack.org (Postfix) with ESMTP id 203CA6B006C for ; Tue, 3 Feb 2015 05:50:14 -0500 (EST) Received: by mail-wg0-f48.google.com with SMTP id x12so43765089wgg.7 for ; Tue, 03 Feb 2015 02:50:13 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id wl1si42090146wjb.94.2015.02.03.02.50.12 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 03 Feb 2015 02:50:12 -0800 (PST) Date: Tue, 3 Feb 2015 10:50:07 +0000 From: Mel Gorman Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203105007.GP2395@suse.de> References: <20150202165525.GM2395@suse.de> <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> <20150202221824.GN2395@suse.de> <20150202143541.1efdd2b571413200cb9a4698@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150202143541.1efdd2b571413200cb9a4698@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org On Mon, Feb 02, 2015 at 02:35:41PM -0800, Andrew Morton wrote: > On Mon, 2 Feb 2015 22:18:24 +0000 Mel Gorman wrote: > > > > Is there something > > > preventing this from being addressed within glibc? > > > > I doubt it other than I expect they'll punt it back and blame either the > > application for being stupid or the kernel for being slow. > > *Is* the application being stupid? What is it actually doing? Only a little. There is little simulated think time between the allocation and the subsequent free. It means the cost of alloc/free dominates where in "real" applications they would either be reusing buffers if they were constantly needed or the think time would mask the cost of the free. > Something like > > pthread_routine() > { > p = malloc(X); > do_some(work); > free(p); > return; > } > Pretty much. There is a search_mem() function that alloc(copy_size) memcpy search free(copy) A real application might try and avoid the copy or reuse buffers if they encountered this particular problem. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f41.google.com (mail-wg0-f41.google.com [74.125.82.41]) by kanga.kvack.org (Postfix) with ESMTP id B59736B0038 for ; Tue, 3 Feb 2015 06:21:30 -0500 (EST) Received: by mail-wg0-f41.google.com with SMTP id a1so43992543wgh.0 for ; Tue, 03 Feb 2015 03:21:30 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id gy6si6808867wib.19.2015.02.03.03.21.28 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 03 Feb 2015 03:21:29 -0800 (PST) Date: Tue, 3 Feb 2015 11:21:24 +0000 From: Mel Gorman Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203112124.GS2395@suse.de> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <20150203094718.GO2395@suse.de> <20150203104756.GB14259@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150203104756.GB14259@node.dhcp.inet.fi> Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: Dave Hansen , linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , Andrew Morton , linux-kernel@vger.kernel.org On Tue, Feb 03, 2015 at 12:47:56PM +0200, Kirill A. Shutemov wrote: > On Tue, Feb 03, 2015 at 09:47:18AM +0000, Mel Gorman wrote: > > On Mon, Feb 02, 2015 at 02:22:36PM -0800, Dave Hansen wrote: > > > On 02/02/2015 08:55 AM, Mel Gorman wrote: > > > > This patch identifies when a thread is frequently calling MADV_DONTNEED > > > > on the same region of memory and starts ignoring the hint. On an 8-core > > > > single-socket machine this was the impact on ebizzy using glibc 2.19. > > > > > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > > > called: > > > > > > > It also claims that the kernel is free to ignore the advice. > > > > > > MADV_DONTNEED > > > > Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > > > > associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > > > > underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > > > > > So if we have anything depending on the behavior that it's _always_ > > > zero-filled after an MADV_DONTNEED, this will break it. > > > > True. I'd be surprised if any application depended on that > > IIUC, jemalloc depends on this[1]. > > [1] https://github.com/jemalloc/jemalloc/blob/dev/src/chunk_mmap.c#L117 > Hope they never back regions with hugetlb then or fall apart if the process called mlockall -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f47.google.com (mail-la0-f47.google.com [209.85.215.47]) by kanga.kvack.org (Postfix) with ESMTP id 996546B0038 for ; Tue, 3 Feb 2015 05:53:08 -0500 (EST) Received: by mail-la0-f47.google.com with SMTP id hz20so50183508lab.6 for ; Tue, 03 Feb 2015 02:53:08 -0800 (PST) Received: from jenni2.inet.fi (mta-out1.inet.fi. [62.71.2.195]) by mx.google.com with ESMTP id o15si18895867laa.27.2015.02.03.02.53.06 for ; Tue, 03 Feb 2015 02:53:06 -0800 (PST) Date: Tue, 3 Feb 2015 12:53:01 +0200 From: "Kirill A. Shutemov" Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203105301.GC14259@node.dhcp.inet.fi> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54D08483.40209@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: Dave Hansen , Mel Gorman , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: > [CC linux-api, man pages] > > On 02/02/2015 11:22 PM, Dave Hansen wrote: > > On 02/02/2015 08:55 AM, Mel Gorman wrote: > >> This patch identifies when a thread is frequently calling MADV_DONTNEED > >> on the same region of memory and starts ignoring the hint. On an 8-core > >> single-socket machine this was the impact on ebizzy using glibc 2.19. > > > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > > called: > > > >> MADV_DONTNEED > >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > > > So if we have anything depending on the behavior that it's _always_ > > zero-filled after an MADV_DONTNEED, this will break it. > > OK, so that's a third person (including me) who understood it as a zero-fill > guarantee. I think the man page should be clarified (if it's indeed not > guaranteed), or we have a bug. > > The implementation actually skips MADV_DONTNEED for > VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. It doesn't skip. It fails with -EINVAL. Or I miss something. > - The word "will result" did sound as a guarantee at least to me. So here it > could be changed to "may result (unless the advice is ignored)"? It's too late to fix documentation. Applications already depends on the beheviour. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f49.google.com (mail-wg0-f49.google.com [74.125.82.49]) by kanga.kvack.org (Postfix) with ESMTP id B14846B0073 for ; Tue, 3 Feb 2015 10:21:25 -0500 (EST) Received: by mail-wg0-f49.google.com with SMTP id k14so45184566wgh.8 for ; Tue, 03 Feb 2015 07:21:25 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id cp9si29207936wib.81.2015.02.03.07.21.23 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 03 Feb 2015 07:21:23 -0800 (PST) Date: Tue, 3 Feb 2015 16:21:21 +0100 From: Michal Hocko Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203152121.GC8914@dhcp22.suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203111600.GR2395@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150203111600.GR2395@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Vlastimil Babka , Dave Hansen , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org On Tue 03-02-15 11:16:00, Mel Gorman wrote: > On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: [...] > > And if we agree that there is indeed no guarantee, what's the actual semantic > > difference from MADV_FREE? I guess none? So there's only a possible perfomance > > difference? > > > > Timing. MADV_DONTNEED if it has an effect is immediate, is a heavier > operations and RSS is reduced. MADV_FREE only has an impact in the future > if there is memory pressure. JFTR. the man page for MADV_FREE has been proposed already (https://lkml.org/lkml/2014/12/5/63 should be the last version AFAIR). I do not see it in the man-pages git tree but the patch was not in time for 3.19 so I guess it will only appear in 3.20. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com [209.85.212.172]) by kanga.kvack.org (Postfix) with ESMTP id 84DA06B006E for ; Tue, 3 Feb 2015 11:25:13 -0500 (EST) Received: by mail-wi0-f172.google.com with SMTP id h11so25616785wiw.5 for ; Tue, 03 Feb 2015 08:25:13 -0800 (PST) Received: from mail-wg0-x236.google.com (mail-wg0-x236.google.com. [2a00:1450:400c:c00::236]) by mx.google.com with ESMTPS id fy4si32111667wib.47.2015.02.03.08.25.11 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 03 Feb 2015 08:25:11 -0800 (PST) Received: by mail-wg0-f54.google.com with SMTP id b13so45416446wgh.13 for ; Tue, 03 Feb 2015 08:25:10 -0800 (PST) Message-ID: <54D0F664.3070606@gmail.com> Date: Tue, 03 Feb 2015 17:25:08 +0100 From: "Michael Kerrisk (man-pages)" MIME-Version: 1.0 Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203111600.GR2395@suse.de> <20150203152121.GC8914@dhcp22.suse.cz> In-Reply-To: <20150203152121.GC8914@dhcp22.suse.cz> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko , Mel Gorman Cc: mtk.manpages@gmail.com, minchan Kim , Dave Hansen , linux-mm@kvack.orgMinchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, linux-man@vger.kernel.org On 02/03/2015 04:21 PM, Michal Hocko wrote: > On Tue 03-02-15 11:16:00, Mel Gorman wrote: >> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: > [...] >>> And if we agree that there is indeed no guarantee, what's the actual semantic >>> difference from MADV_FREE? I guess none? So there's only a possible perfomance >>> difference? >>> >> >> Timing. MADV_DONTNEED if it has an effect is immediate, is a heavier >> operations and RSS is reduced. MADV_FREE only has an impact in the future >> if there is memory pressure. > > JFTR. the man page for MADV_FREE has been proposed already > (https://lkml.org/lkml/2014/12/5/63 should be the last version AFAIR). I > do not see it in the man-pages git tree but the patch was not in time > for 3.19 so I guess it will only appear in 3.20. > Yikes! That patch was buried in the bottom of a locked filing cabinet in a disused lavatory. I unfortunately don't read every thread that comes my way, especially if it doesn't look like a man-pages patch (i.e., falls in the middle of an LKML thread that starts on another topic, and doesn't see linux-man@). I'll respond to that patch soon. (There are some problems that mean I could not accept it, AFAICT.) Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f45.google.com (mail-pa0-f45.google.com [209.85.220.45]) by kanga.kvack.org (Postfix) with ESMTP id 3E8066B009E for ; Tue, 3 Feb 2015 19:09:30 -0500 (EST) Received: by mail-pa0-f45.google.com with SMTP id et14so102682418pad.4 for ; Tue, 03 Feb 2015 16:09:30 -0800 (PST) Received: from mail-pa0-x22c.google.com (mail-pa0-x22c.google.com. [2607:f8b0:400e:c03::22c]) by mx.google.com with ESMTPS id pk3si4382416pdb.166.2015.02.03.16.09.29 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 03 Feb 2015 16:09:29 -0800 (PST) Received: by mail-pa0-f44.google.com with SMTP id rd3so102523205pab.3 for ; Tue, 03 Feb 2015 16:09:29 -0800 (PST) Date: Wed, 4 Feb 2015 09:09:21 +0900 From: Minchan Kim Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150204000921.GC3583@blaptop> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54D0B43D.8000209@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org, Rik van Riel On Tue, Feb 03, 2015 at 12:42:53PM +0100, Vlastimil Babka wrote: > On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: > > On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: > >> [CC linux-api, man pages] > >> > >> On 02/02/2015 11:22 PM, Dave Hansen wrote: > >> > On 02/02/2015 08:55 AM, Mel Gorman wrote: > >> >> This patch identifies when a thread is frequently calling MADV_DONTNEED > >> >> on the same region of memory and starts ignoring the hint. On an 8-core > >> >> single-socket machine this was the impact on ebizzy using glibc 2.19. > >> > > >> > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > >> > called: > >> > > >> >> MADV_DONTNEED > >> >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > >> >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > >> >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > >> > > >> > So if we have anything depending on the behavior that it's _always_ > >> > zero-filled after an MADV_DONTNEED, this will break it. > >> > >> OK, so that's a third person (including me) who understood it as a zero-fill > >> guarantee. I think the man page should be clarified (if it's indeed not > >> guaranteed), or we have a bug. > >> > >> The implementation actually skips MADV_DONTNEED for > >> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. > > > > It doesn't skip. It fails with -EINVAL. Or I miss something. > > No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in > this case: > > * The application is attempting to release locked or shared pages (with > MADV_DONTNEED). > > - that covers mlocking ok, not sure if the rest fits the "shared pages" case > though. I dont see any check for other kinds of shared pages in the code. > > >> - The word "will result" did sound as a guarantee at least to me. So here it > >> could be changed to "may result (unless the advice is ignored)"? > > > > It's too late to fix documentation. Applications already depends on the > > beheviour. > > Right, so as long as they check for EINVAL, it should be safe. It appears that > jemalloc does. > > I still wouldnt be sure just by reading the man page that the clearing is > guaranteed whenever I dont get an error return value, though, > IMHO, Man page said "MADV_DONTNEED: Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file." Heap by allocated by malloc(3) is anonymous page so it's a mapping withtout an underlying file so userspace can expect zero-fill. Man page said "EINVAL: The application is attempting to release locked or shared pages (with MADV_DONTNEED)" So, user can expect the call on area by allocated by malloc(3) if he doesn't call mlock will always be successful. Man page said "madivse: This call does not influence the semantics of the application (except in the case of MADV_DONTNEED)" So, we shouldn't break MADV_DONTNEED's semantic which free pages instantly. It's a long time semantic and it was one of arguable issues on MADV_FREE Rik had tried long time ago to replace MADV_DONTNEED with MADV_FREE. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f174.google.com (mail-we0-f174.google.com [74.125.82.174]) by kanga.kvack.org (Postfix) with ESMTP id E85096B0073 for ; Wed, 4 Feb 2015 08:46:04 -0500 (EST) Received: by mail-we0-f174.google.com with SMTP id w55so1833385wes.5 for ; Wed, 04 Feb 2015 05:46:04 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id hz2si3064778wjb.173.2015.02.04.05.46.02 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 04 Feb 2015 05:46:03 -0800 (PST) Message-ID: <54D22298.3040504@suse.cz> Date: Wed, 04 Feb 2015 14:46:00 +0100 From: Vlastimil Babka MIME-Version: 1.0 Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> In-Reply-To: <54D0F56A.9050003@gmail.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "Michael Kerrisk (man-pages)" , "Kirill A. Shutemov" Cc: Dave Hansen , Mel Gorman , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, linux-man@vger.kernel.org, Hugh Dickins On 02/03/2015 05:20 PM, Michael Kerrisk (man-pages) wrote: > Hello Vlastimil > > Thanks for CCing me into this thread. NP > On 02/03/2015 12:42 PM, Vlastimil Babka wrote: >> On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: >>> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: >>> >>> It doesn't skip. It fails with -EINVAL. Or I miss something. >> >> No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in >> this case: >> >> * The application is attempting to release locked or shared pages (with >> MADV_DONTNEED). > > Yes, there is that. But the page could be more explicit when discussing > MADV_DONTNEED in the main text. I've done that. > >> - that covers mlocking ok, not sure if the rest fits the "shared pages" case >> though. I dont see any check for other kinds of shared pages in the code. > > Agreed. "shared" here seems confused. I've removed it. And I've > added mention of "Huge TLB pages" for this error. > Thanks. >>>> - The word "will result" did sound as a guarantee at least to me. So here it >>>> could be changed to "may result (unless the advice is ignored)"? >>> >>> It's too late to fix documentation. Applications already depends on the >>> beheviour. >> >> Right, so as long as they check for EINVAL, it should be safe. It appears that >> jemalloc does. > > So, first a brief question: in the cases where the call does not error out, > are we agreed that in the current implementation, MADV_DONTNEED will > always result in zero-filled pages when the region is faulted back in > (when we consider pages that are not backed by a file)? I'd agree at this point. Also we should probably mention anonymously shared pages (shmem). I think they behave the same as file here. >> I still wouldnt be sure just by reading the man page that the clearing is >> guaranteed whenever I dont get an error return value, though, > > I'm not quite sure what you want here. I mean: if there's an error, I was just reiterating that the guarantee is not clear from if you consider all the statements in the man page. > then the DONTNEED action didn't occur, right? Therefore, there won't > be zero-filled pages. But, for what it's worth, I added "If the > operation succeeds" at the start of that sentence beginning "Subsequent > accesses...". Yes, that should clarify it. Thanks! > Now, some history, explaining why the page is a bit of a mess, > and for that matter why I could really use more help on it from MM > folk (especially in the form of actual patches [1], rather than notes > about deficiencies in the documentation), because: > > ***I simply cannot keep up with all of the details***. I see, and expected it would be like this. I would just send patch if the situation was clear, but here we should agree first, and I thought you should be involved from the beginning. > Once upon a time (Linux 2.4), there was madvise() with just 5 flags: > > MADV_NORMAL > MADV_RANDOM > MADV_SEQUENTIAL > MADV_WILLNEED > MADV_DONTNEED > > And already a dozen years ago, *I* added the text about MADV_DONTNEED. > Back then, I believe it was true. I'm not sure if it's still true now, > but I assume for the moment that it is, and await feedback. And the > text saying that the call does not affect the semantics of memory > access dates back even further (and was then true, MADV_DONTNEED aside). > > Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there > is a semantic mismatch between the destructive MADV_DONTNEED and > POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear > on most other implementations. > > Since the original implementation, numerous pieces of cruft^W^W^W > excellent new flags have been overloaded into this one system call. > Some of those certainly violated the "does not change the semantics > of the application" statement, but, sadly, the kernel developers who > implemented MADV_REMOVE or MADV_DONTFORK did not think to send a > patch to the man page for those new flags, one that might have noted > that the semantics of the application are changed by such flags. Equally > sadly, I did overlook to scan the bigger page when *I* added > documentation of these flags to those pages, otherwise I might have > caught that detail. > > So, just to repeat, I could really use more help on it from MM > folk in the form of actual patches to the man page. Thanks for the background. I'll try to remember to check for man-pages part when I review some api changing patch. > Thanks, > > Michael > > [1] https://www.kernel.org/doc/man-pages/patches.html > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f47.google.com (mail-la0-f47.google.com [209.85.215.47]) by kanga.kvack.org (Postfix) with ESMTP id 84C756B0075 for ; Wed, 4 Feb 2015 09:01:22 -0500 (EST) Received: by mail-la0-f47.google.com with SMTP id hz20so1674007lab.6 for ; Wed, 04 Feb 2015 06:01:21 -0800 (PST) Received: from mail-la0-x22b.google.com (mail-la0-x22b.google.com. [2a00:1450:4010:c03::22b]) by mx.google.com with ESMTPS id ai4si1453617lbc.10.2015.02.04.06.01.20 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 04 Feb 2015 06:01:20 -0800 (PST) Received: by mail-la0-f43.google.com with SMTP id pn19so1699161lab.2 for ; Wed, 04 Feb 2015 06:01:20 -0800 (PST) MIME-Version: 1.0 Reply-To: mtk.manpages@gmail.com In-Reply-To: <54D22298.3040504@suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> From: "Michael Kerrisk (man-pages)" Date: Wed, 4 Feb 2015 15:00:59 +0100 Message-ID: Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Hello Vlastimil, On 4 February 2015 at 14:46, Vlastimil Babka wrote: > On 02/03/2015 05:20 PM, Michael Kerrisk (man-pages) wrote: >> >> On 02/03/2015 12:42 PM, Vlastimil Babka wrote: >>> >>> On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: >>>> >>>> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: >>>> >>>> It doesn't skip. It fails with -EINVAL. Or I miss something. >>> >>> >>> No, I missed that. Thanks for pointing out. The manpage also explains >>> EINVAL in >>> this case: >>> >>> * The application is attempting to release locked or shared pages (with >>> MADV_DONTNEED). >> >> Yes, there is that. But the page could be more explicit when discussing >> MADV_DONTNEED in the main text. I've done that. >> >>> - that covers mlocking ok, not sure if the rest fits the "shared pages" >>> case >>> though. I dont see any check for other kinds of shared pages in the code. >> >> Agreed. "shared" here seems confused. I've removed it. And I've >> added mention of "Huge TLB pages" for this error. > > Thanks. I also added those cases for MADV_REMOVE, BTW. >>>>> - The word "will result" did sound as a guarantee at least to me. So >>>>> here it >>>>> could be changed to "may result (unless the advice is ignored)"? >>>> >>>> It's too late to fix documentation. Applications already depends on the >>>> beheviour. >>> >>> Right, so as long as they check for EINVAL, it should be safe. It appears >>> that >>> jemalloc does. >> >> >> So, first a brief question: in the cases where the call does not error >> out, >> are we agreed that in the current implementation, MADV_DONTNEED will >> always result in zero-filled pages when the region is faulted back in >> (when we consider pages that are not backed by a file)? > > > I'd agree at this point. Thanks for the confirmation. > Also we should probably mention anonymously shared pages (shmem). I think > they behave the same as file here. You mean tmpfs here, right? (I don't keep all of the synonyms straight.) >>> I still wouldnt be sure just by reading the man page that the clearing is >>> guaranteed whenever I dont get an error return value, though, >> >> I'm not quite sure what you want here. I mean: if there's an error, > > I was just reiterating that the guarantee is not clear from if you consider > all the statements in the man page. > >> then the DONTNEED action didn't occur, right? Therefore, there won't >> be zero-filled pages. But, for what it's worth, I added "If the >> operation succeeds" at the start of that sentence beginning "Subsequent >> accesses...". > > Yes, that should clarify it. Thanks! Okay. >> Now, some history, explaining why the page is a bit of a mess, >> and for that matter why I could really use more help on it from MM >> folk (especially in the form of actual patches [1], rather than notes >> about deficiencies in the documentation), because: >> >> ***I simply cannot keep up with all of the details***. > > I see, and expected it would be like this. I would just send patch if the > situation was clear, but here we should agree first, and I thought you > should be involved from the beginning. Sorry -- I should have made it clearer, this statement was not targeted at you personally, or even necessarily at this particular thread. It was a general comment, that came up sharply to me as I looked at how much cruft there is in the madvise() page. >> Once upon a time (Linux 2.4), there was madvise() with just 5 flags: >> >> MADV_NORMAL >> MADV_RANDOM >> MADV_SEQUENTIAL >> MADV_WILLNEED >> MADV_DONTNEED >> >> And already a dozen years ago, *I* added the text about MADV_DONTNEED. >> Back then, I believe it was true. I'm not sure if it's still true now, >> but I assume for the moment that it is, and await feedback. And the >> text saying that the call does not affect the semantics of memory >> access dates back even further (and was then true, MADV_DONTNEED aside). >> >> Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there >> is a semantic mismatch between the destructive MADV_DONTNEED and >> POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear >> on most other implementations. >> >> Since the original implementation, numerous pieces of cruft^W^W^W >> excellent new flags have been overloaded into this one system call. >> Some of those certainly violated the "does not change the semantics >> of the application" statement, but, sadly, the kernel developers who >> implemented MADV_REMOVE or MADV_DONTFORK did not think to send a >> patch to the man page for those new flags, one that might have noted >> that the semantics of the application are changed by such flags. Equally >> sadly, I did overlook to scan the bigger page when *I* added >> documentation of these flags to those pages, otherwise I might have >> caught that detail. >> >> So, just to repeat, I could really use more help on it from MM >> folk in the form of actual patches to the man page. > > Thanks for the background. I'll try to remember to check for man-pages part > when I review some api changing patch. That would be great. Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f178.google.com (mail-lb0-f178.google.com [209.85.217.178]) by kanga.kvack.org (Postfix) with ESMTP id CB1F4900015 for ; Wed, 4 Feb 2015 14:24:49 -0500 (EST) Received: by mail-lb0-f178.google.com with SMTP id u10so3294259lbd.9 for ; Wed, 04 Feb 2015 11:24:49 -0800 (PST) Received: from mail-lb0-x22d.google.com (mail-lb0-x22d.google.com. [2a00:1450:4010:c04::22d]) by mx.google.com with ESMTPS id w8si2224113lbb.25.2015.02.04.11.24.47 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 04 Feb 2015 11:24:48 -0800 (PST) Received: by mail-lb0-f173.google.com with SMTP id p9so3325380lbv.4 for ; Wed, 04 Feb 2015 11:24:47 -0800 (PST) MIME-Version: 1.0 Reply-To: mtk.manpages@gmail.com In-Reply-To: <54D2508A.9030804@suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> From: "Michael Kerrisk (man-pages)" Date: Wed, 4 Feb 2015 20:24:27 +0100 Message-ID: Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins On 4 February 2015 at 18:02, Vlastimil Babka wrote: > On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: >> >> Hello Vlastimil, >> >> On 4 February 2015 at 14:46, Vlastimil Babka wrote: >>>>> >>>>> - that covers mlocking ok, not sure if the rest fits the "shared page= s" >>>>> case >>>>> though. I dont see any check for other kinds of shared pages in the >>>>> code. >>>> >>>> >>>> Agreed. "shared" here seems confused. I've removed it. And I've >>>> added mention of "Huge TLB pages" for this error. >>> >>> >>> Thanks. >> >> >> I also added those cases for MADV_REMOVE, BTW. > > > Right. There's also the following for MADV_REMOVE that needs updating: > > "Currently, only shmfs/tmpfs supports this; other filesystems return with > the error ENOSYS." > > - it's not just shmem/tmpfs anymore. It should be best to refer to > fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to > date. > > - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code= is > listed in the ERRORS section. Yup, I recently added that as well, based on a patch from Jan Chaloupka. >>>>>>> - The word "will result" did sound as a guarantee at least to me. S= o >>>>>>> here it >>>>>>> could be changed to "may result (unless the advice is ignored)"? >>>>>> >>>>>> It's too late to fix documentation. Applications already depends on >>>>>> the >>>>>> beheviour. >>>>> >>>>> Right, so as long as they check for EINVAL, it should be safe. It >>>>> appears >>>>> that >>>>> jemalloc does. >>>> >>>> So, first a brief question: in the cases where the call does not error >>>> out, >>>> are we agreed that in the current implementation, MADV_DONTNEED will >>>> always result in zero-filled pages when the region is faulted back in >>>> (when we consider pages that are not backed by a file)? >>> >>> I'd agree at this point. >> >> Thanks for the confirmation. >> >>> Also we should probably mention anonymously shared pages (shmem). I thi= nk >>> they behave the same as file here. >> >> You mean tmpfs here, right? (I don't keep all of the synonyms straight.) > > shmem is tmpfs (that by itself would fit under "files" just fine), but al= so > sys V segments created by shmget(2) and also mappings created by mmap wit= h > MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to > refer to the full list. So, how about this text: After a successful MADV_DONTNEED operation, the seman=E2=80= =90 tics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either reloading of the memory contents from the underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings. Thanks, Michael -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f45.google.com (mail-pa0-f45.google.com [209.85.220.45]) by kanga.kvack.org (Postfix) with ESMTP id 4D1956B009D for ; Wed, 4 Feb 2015 20:08:09 -0500 (EST) Received: by mail-pa0-f45.google.com with SMTP id et14so6354107pad.4 for ; Wed, 04 Feb 2015 17:08:09 -0800 (PST) Received: from mail-pa0-x22d.google.com (mail-pa0-x22d.google.com. [2607:f8b0:400e:c03::22d]) by mx.google.com with ESMTPS id e2si4142221pdc.112.2015.02.04.17.08.08 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 04 Feb 2015 17:08:08 -0800 (PST) Received: by mail-pa0-f45.google.com with SMTP id et14so6354034pad.4 for ; Wed, 04 Feb 2015 17:08:08 -0800 (PST) Date: Thu, 5 Feb 2015 10:07:58 +0900 From: Minchan Kim Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150205010757.GA20996@blaptop> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Hello, On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-pages) wrote: > On 4 February 2015 at 18:02, Vlastimil Babka wrote: > > On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: > >> > >> Hello Vlastimil, > >> > >> On 4 February 2015 at 14:46, Vlastimil Babka wrote: > >>>>> > >>>>> - that covers mlocking ok, not sure if the rest fits the "shared pages" > >>>>> case > >>>>> though. I dont see any check for other kinds of shared pages in the > >>>>> code. > >>>> > >>>> > >>>> Agreed. "shared" here seems confused. I've removed it. And I've > >>>> added mention of "Huge TLB pages" for this error. > >>> > >>> > >>> Thanks. > >> > >> > >> I also added those cases for MADV_REMOVE, BTW. > > > > > > Right. There's also the following for MADV_REMOVE that needs updating: > > > > "Currently, only shmfs/tmpfs supports this; other filesystems return with > > the error ENOSYS." > > > > - it's not just shmem/tmpfs anymore. It should be best to refer to > > fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to > > date. > > > > - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code is > > listed in the ERRORS section. > > Yup, I recently added that as well, based on a patch from Jan Chaloupka. > > >>>>>>> - The word "will result" did sound as a guarantee at least to me. So > >>>>>>> here it > >>>>>>> could be changed to "may result (unless the advice is ignored)"? > >>>>>> > >>>>>> It's too late to fix documentation. Applications already depends on > >>>>>> the > >>>>>> beheviour. > >>>>> > >>>>> Right, so as long as they check for EINVAL, it should be safe. It > >>>>> appears > >>>>> that > >>>>> jemalloc does. > >>>> > >>>> So, first a brief question: in the cases where the call does not error > >>>> out, > >>>> are we agreed that in the current implementation, MADV_DONTNEED will > >>>> always result in zero-filled pages when the region is faulted back in > >>>> (when we consider pages that are not backed by a file)? > >>> > >>> I'd agree at this point. > >> > >> Thanks for the confirmation. > >> > >>> Also we should probably mention anonymously shared pages (shmem). I think > >>> they behave the same as file here. > >> > >> You mean tmpfs here, right? (I don't keep all of the synonyms straight.) > > > > shmem is tmpfs (that by itself would fit under "files" just fine), but also > > sys V segments created by shmget(2) and also mappings created by mmap with > > MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to > > refer to the full list. > > So, how about this text: > > After a successful MADV_DONTNEED operation, the semana?? > tics of memory access in the specified region are > changed: subsequent accesses of pages in the range > will succeed, but will result in either reloading of > the memory contents from the underlying mapped file > (for shared file mappings, shared anonymous mappings, > and shmem-based techniques such as System V shared > memory segments) or zero-fill-on-demand pages for > anonymous private mappings. Hmm, I'd like to clarify. Whether it was intention or not, some of userspace developers thought about that syscall drop pages instantly if was no-error return so that they will see more free pages(ie, rss for the process will be decreased) with keeping the VMA. Can we rely on it? And we should make error section, too. "locked" covers mlock(2) and you said you will add hugetlb. Then, VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP? special mapping for some drivers? One more thing, "The kernel is free to ignore the advice". It conflicts "This call does not influence the semantics of the application (except in the case of MADV_DONTNEED)" so is it okay we can believe "The kernel is free to ingmore the advise except MADV_DONTNEED"? Thanks. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id 18F64828FD for ; Thu, 5 Feb 2015 10:41:08 -0500 (EST) Received: by mail-wg0-f52.google.com with SMTP id y19so8245956wgg.11 for ; Thu, 05 Feb 2015 07:41:07 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id f4si9855777wje.8.2015.02.05.07.41.05 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 05 Feb 2015 07:41:06 -0800 (PST) Date: Thu, 5 Feb 2015 16:41:02 +0100 From: Michal Hocko Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150205154102.GA20607@dhcp22.suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins On Wed 04-02-15 20:24:27, Michael Kerrisk wrote: [...] > So, how about this text: > > After a successful MADV_DONTNEED operation, the semana?? > tics of memory access in the specified region are > changed: subsequent accesses of pages in the range > will succeed, but will result in either reloading of > the memory contents from the underlying mapped file " result in either providing the up-to-date contents of the underlying mapped file " Would be more precise IMO because reload might be interpreted as a major fault which is not necessarily the case (see below). > (for shared file mappings, shared anonymous mappings, > and shmem-based techniques such as System V shared > memory segments) or zero-fill-on-demand pages for > anonymous private mappings. Yes, this wording is better because many users are not aware of MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't mention that. I am just wondering whether it makes sense to mention that MADV_DONTNEED for shared mappings might be surprising and not freeing the backing pages thus not really freeing memory until there is a memory pressure. But maybe this is too implementation specific for a man page. What about the following wording on top of yours? " Please note that the MADV_DONTNEED hint on shared mappings might not lead to immediate freeing of pages in the range. The kernel is free to delay this until an appropriate moment. RSS of the calling process will be reduced however. " -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f169.google.com (mail-we0-f169.google.com [74.125.82.169]) by kanga.kvack.org (Postfix) with ESMTP id 06958828FD for ; Thu, 5 Feb 2015 16:44:53 -0500 (EST) Received: by mail-we0-f169.google.com with SMTP id x3so6492503wes.0 for ; Thu, 05 Feb 2015 13:44:52 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id wb6si749283wjc.62.2015.02.05.13.44.50 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 05 Feb 2015 13:44:51 -0800 (PST) Message-ID: <54D3E44B.7060501@redhat.com> Date: Thu, 05 Feb 2015 16:44:43 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> <20150202221824.GN2395@suse.de> In-Reply-To: <20150202221824.GN2395@suse.de> Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Andrew Morton Cc: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org On 02/02/2015 05:18 PM, Mel Gorman wrote: > On Mon, Feb 02, 2015 at 02:05:06PM -0800, Andrew Morton wrote: >> On Mon, 2 Feb 2015 16:55:25 +0000 Mel Gorman wrote: >> >>> glibc malloc changed behaviour in glibc 2.10 to have per-thread arenas >>> instead of creating new areans if the existing ones were contended. >>> The decision appears to have been made so the allocator scales better but the >>> downside is that madvise(MADV_DONTNEED) is now called for these per-thread >>> areans during free. This tears down pages that would have previously >>> remained. There is nothing wrong with this decision from a functional point >>> of view but any threaded application that frequently allocates/frees the >>> same-sized region is going to incur the full teardown and refault costs. >> >> MADV_DONTNEED has been there for many years. How could this problem >> not have been noticed during glibc 2.10 development/testing? > > I do not know. I only spotted it due to switching distributions. Looping > allocations and frees of the same sizes is considered inefficient and it > might have been dismissed on those grounds. It's probably less noticeable > when it only affects threaded applications. > >> Is there >> some more recent kernel change which is triggering this? >> > > Not that I'm aware of. > >>> This patch identifies when a thread is frequently calling MADV_DONTNEED >>> on the same region of memory and starts ignoring the hint. >> >> That's pretty nasty-looking :( >> > > Yep, it is but we're very limited in terms of what we can do within the > kernel here. > >> And presumably there are all sorts of behaviours which will still >> trigger the problem but which will avoid the start/end equality test in >> ignore_madvise_hint()? >> > > Yes. I would expect that a simple pattern of multiple allocs followed by > multiple frees in a loop would also trigger it. > >> Really, this is a glibc problem and only a glibc problem. >> MADV_DONTNEED is unavoidably expensive and glibc is calling >> MADV_DONTNEED for a region which it *does* need. > > To be fair to glibc, it calls it on a region it *thinks* it doesn't need only > to reuse it immediately afterwards because of how the benchmark is > implemented. > >> Is there something >> preventing this from being addressed within glibc? > > I doubt it other than I expect they'll punt it back and blame either the > application for being stupid or the kernel for being slow. This sounds like something that could benefit from Minchan's MADV_FREE, instead of MADV_DONTNEED. If non page aligned malloc/free does not depend on pages being zeroed, I suspect an MADV_DONTNEED resulting from a malloc/free loop also does not depend on it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f170.google.com (mail-wi0-f170.google.com [209.85.212.170]) by kanga.kvack.org (Postfix) with ESMTP id 6A32D6B006E for ; Fri, 6 Feb 2015 10:41:24 -0500 (EST) Received: by mail-wi0-f170.google.com with SMTP id bs8so3395036wib.1 for ; Fri, 06 Feb 2015 07:41:23 -0800 (PST) Received: from mail-wi0-x232.google.com (mail-wi0-x232.google.com. [2a00:1450:400c:c05::232]) by mx.google.com with ESMTPS id z4si2396961wij.33.2015.02.06.07.41.22 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 06 Feb 2015 07:41:22 -0800 (PST) Received: by mail-wi0-f178.google.com with SMTP id bs8so3326938wib.5 for ; Fri, 06 Feb 2015 07:41:22 -0800 (PST) Message-ID: <54D4E098.8050004@gmail.com> Date: Fri, 06 Feb 2015 16:41:12 +0100 From: "Michael Kerrisk (man-pages)" MIME-Version: 1.0 Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205010757.GA20996@blaptop> In-Reply-To: <20150205010757.GA20996@blaptop> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: mtk.manpages@gmail.com, Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins On 02/05/2015 02:07 AM, Minchan Kim wrote: > Hello, > > On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-pages) wrote: >> On 4 February 2015 at 18:02, Vlastimil Babka wrote: >>> On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: >>>> >>>> Hello Vlastimil, >>>> >>>> On 4 February 2015 at 14:46, Vlastimil Babka wrote: >>>>>>> >>>>>>> - that covers mlocking ok, not sure if the rest fits the "shared pages" >>>>>>> case >>>>>>> though. I dont see any check for other kinds of shared pages in the >>>>>>> code. >>>>>> >>>>>> >>>>>> Agreed. "shared" here seems confused. I've removed it. And I've >>>>>> added mention of "Huge TLB pages" for this error. >>>>> >>>>> >>>>> Thanks. >>>> >>>> >>>> I also added those cases for MADV_REMOVE, BTW. >>> >>> >>> Right. There's also the following for MADV_REMOVE that needs updating: >>> >>> "Currently, only shmfs/tmpfs supports this; other filesystems return with >>> the error ENOSYS." >>> >>> - it's not just shmem/tmpfs anymore. It should be best to refer to >>> fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to >>> date. >>> >>> - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code is >>> listed in the ERRORS section. >> >> Yup, I recently added that as well, based on a patch from Jan Chaloupka. >> >>>>>>>>> - The word "will result" did sound as a guarantee at least to me. So >>>>>>>>> here it >>>>>>>>> could be changed to "may result (unless the advice is ignored)"? >>>>>>>> >>>>>>>> It's too late to fix documentation. Applications already depends on >>>>>>>> the >>>>>>>> beheviour. >>>>>>> >>>>>>> Right, so as long as they check for EINVAL, it should be safe. It >>>>>>> appears >>>>>>> that >>>>>>> jemalloc does. >>>>>> >>>>>> So, first a brief question: in the cases where the call does not error >>>>>> out, >>>>>> are we agreed that in the current implementation, MADV_DONTNEED will >>>>>> always result in zero-filled pages when the region is faulted back in >>>>>> (when we consider pages that are not backed by a file)? >>>>> >>>>> I'd agree at this point. >>>> >>>> Thanks for the confirmation. >>>> >>>>> Also we should probably mention anonymously shared pages (shmem). I think >>>>> they behave the same as file here. >>>> >>>> You mean tmpfs here, right? (I don't keep all of the synonyms straight.) >>> >>> shmem is tmpfs (that by itself would fit under "files" just fine), but also >>> sys V segments created by shmget(2) and also mappings created by mmap with >>> MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to >>> refer to the full list. >> >> So, how about this text: >> >> After a successful MADV_DONTNEED operation, the semana?? >> tics of memory access in the specified region are >> changed: subsequent accesses of pages in the range >> will succeed, but will result in either reloading of >> the memory contents from the underlying mapped file >> (for shared file mappings, shared anonymous mappings, >> and shmem-based techniques such as System V shared >> memory segments) or zero-fill-on-demand pages for >> anonymous private mappings. > > Hmm, I'd like to clarify. > > Whether it was intention or not, some of userspace developers thought > about that syscall drop pages instantly if was no-error return so that > they will see more free pages(ie, rss for the process will be decreased) > with keeping the VMA. Can we rely on it? I do not know. Michael? > And we should make error section, too. > "locked" covers mlock(2) and you said you will add hugetlb. Then, > VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP? > special mapping for some drivers? I'm open for offers on what to add. > One more thing, "The kernel is free to ignore the advice". > It conflicts "This call does not influence the semantics of the > application (except in the case of MADV_DONTNEED)" so > is it okay we can believe "The kernel is free to ingmore the advise > except MADV_DONTNEED"? I decided to just drop the sentence The kernel is free to ignore the advice. It creates misunderstandings, and does not really add information. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f169.google.com (mail-wi0-f169.google.com [209.85.212.169]) by kanga.kvack.org (Postfix) with ESMTP id 7A8E76B006C for ; Fri, 6 Feb 2015 10:58:01 -0500 (EST) Received: by mail-wi0-f169.google.com with SMTP id h11so3504139wiw.0 for ; Fri, 06 Feb 2015 07:58:00 -0800 (PST) Received: from mail-we0-x235.google.com (mail-we0-x235.google.com. [2a00:1450:400c:c03::235]) by mx.google.com with ESMTPS id kq6si5076674wjc.34.2015.02.06.07.57.59 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 06 Feb 2015 07:58:00 -0800 (PST) Received: by mail-we0-f181.google.com with SMTP id k48so14522734wev.12 for ; Fri, 06 Feb 2015 07:57:59 -0800 (PST) Message-ID: <54D4E47E.4020509@gmail.com> Date: Fri, 06 Feb 2015 16:57:50 +0100 From: "Michael Kerrisk (man-pages)" MIME-Version: 1.0 Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205154102.GA20607@dhcp22.suse.cz> In-Reply-To: <20150205154102.GA20607@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: mtk.manpages@gmail.com, Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Hi Michael On 02/05/2015 04:41 PM, Michal Hocko wrote: > On Wed 04-02-15 20:24:27, Michael Kerrisk wrote: > [...] >> So, how about this text: >> >> After a successful MADV_DONTNEED operation, the semana?? >> tics of memory access in the specified region are >> changed: subsequent accesses of pages in the range >> will succeed, but will result in either reloading of >> the memory contents from the underlying mapped file > > " > result in either providing the up-to-date contents of the underlying > mapped file > " Thanks! I did something like that. See below. > Would be more precise IMO because reload might be interpreted as a major > fault which is not necessarily the case (see below). > >> (for shared file mappings, shared anonymous mappings, >> and shmem-based techniques such as System V shared >> memory segments) or zero-fill-on-demand pages for >> anonymous private mappings. > > Yes, this wording is better because many users are not aware of > MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't > mention that. (Michal, would you have a text to propose to add to the mmap(2) page? Maybe it would be useful to add something there.) > > I am just wondering whether it makes sense to mention that MADV_DONTNEED > for shared mappings might be surprising and not freeing the backing > pages thus not really freeing memory until there is a memory > pressure. But maybe this is too implementation specific for a man > page. What about the following wording on top of yours? > " > Please note that the MADV_DONTNEED hint on shared mappings might not > lead to immediate freeing of pages in the range. The kernel is free to > delay this until an appropriate moment. RSS of the calling process will > be reduced however. > " Thanks! I added this, but dropped in the word "immediately" in the last sentence, since I assume that was implied. So now we have: After a successful MADV_DONTNEED operation, the semana?? tics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the mema?? ory contents from the up-to-date contents of the undera?? lying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings. Note that, when applied to shared mappings, MADV_DONTa?? NEED might not lead to immediate freeing of the pages in the range. The kernel is free to delay freeing the pages until an appropriate moment. The resident set size (RSS) of the calling process will be immediately reduced however. The current draft of the page can be found in a branch, http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_madvise Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f171.google.com (mail-we0-f171.google.com [74.125.82.171]) by kanga.kvack.org (Postfix) with ESMTP id 02CFE6B0075 for ; Fri, 6 Feb 2015 15:45:42 -0500 (EST) Received: by mail-we0-f171.google.com with SMTP id k11so16004568wes.2 for ; Fri, 06 Feb 2015 12:45:41 -0800 (PST) Received: from mail-wi0-x231.google.com (mail-wi0-x231.google.com. [2a00:1450:400c:c05::231]) by mx.google.com with ESMTPS id l2si6075651wja.190.2015.02.06.12.45.38 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 06 Feb 2015 12:45:39 -0800 (PST) Received: by mail-wi0-f177.google.com with SMTP id r20so5206267wiv.4 for ; Fri, 06 Feb 2015 12:45:38 -0800 (PST) Date: Fri, 6 Feb 2015 21:45:36 +0100 From: Michal Hocko Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150206204536.GA24245@dhcp22.suse.cz> References: <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205154102.GA20607@dhcp22.suse.cz> <54D4E47E.4020509@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <54D4E47E.4020509@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins On Fri 06-02-15 16:57:50, Michael Kerrisk wrote: [...] > > Yes, this wording is better because many users are not aware of > > MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't > > mention that. > > (Michal, would you have a text to propose to add to the mmap(2) page? > Maybe it would be useful to add something there.) I am half way on vacation, but I can cook a patch after I am back after week. > > I am just wondering whether it makes sense to mention that MADV_DONTNEED > > for shared mappings might be surprising and not freeing the backing > > pages thus not really freeing memory until there is a memory > > pressure. But maybe this is too implementation specific for a man > > page. What about the following wording on top of yours? > > " > > Please note that the MADV_DONTNEED hint on shared mappings might not > > lead to immediate freeing of pages in the range. The kernel is free to > > delay this until an appropriate moment. RSS of the calling process will > > be reduced however. > > " > > Thanks! I added this, but dropped in the word "immediately" in the last > sentence, since I assume that was implied. So now we have: > > After a successful MADV_DONTNEED operation, the semana?? > tics of memory access in the specified region are > changed: subsequent accesses of pages in the range will > succeed, but will result in either repopulating the mema?? > ory contents from the up-to-date contents of the undera?? > lying mapped file (for shared file mappings, shared > anonymous mappings, and shmem-based techniques such as > System V shared memory segments) or zero-fill-on-demand > pages for anonymous private mappings. > > Note that, when applied to shared mappings, MADV_DONTa?? > NEED might not lead to immediate freeing of the pages in > the range. The kernel is free to delay freeing the > pages until an appropriate moment. The resident set > size (RSS) of the calling process will be immediately > reduced however. This sounds good to me and it is definitely much better than the current state. Thanks! > The current draft of the page can be found in a branch, > http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_madvise > > Thanks, > > Michael > > > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by kanga.kvack.org (Postfix) with ESMTP id E78816B0032 for ; Mon, 9 Feb 2015 01:46:12 -0500 (EST) Received: by mail-pa0-f53.google.com with SMTP id lf10so24198106pab.12 for ; Sun, 08 Feb 2015 22:46:12 -0800 (PST) Received: from mail-pa0-x22e.google.com (mail-pa0-x22e.google.com. [2607:f8b0:400e:c03::22e]) by mx.google.com with ESMTPS id pz3si5476259pbb.32.2015.02.08.22.46.11 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 08 Feb 2015 22:46:11 -0800 (PST) Received: by mail-pa0-f46.google.com with SMTP id lj1so31748189pab.5 for ; Sun, 08 Feb 2015 22:46:11 -0800 (PST) Date: Mon, 9 Feb 2015 15:46:00 +0900 From: Minchan Kim Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150209064600.GA32300@blaptop> References: <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205010757.GA20996@blaptop> <54D4E098.8050004@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <54D4E098.8050004@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Hello, Michael On Fri, Feb 06, 2015 at 04:41:12PM +0100, Michael Kerrisk (man-pages) wrote: > On 02/05/2015 02:07 AM, Minchan Kim wrote: > > Hello, > > > > On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-pages) wrote: > >> On 4 February 2015 at 18:02, Vlastimil Babka wrote: > >>> On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: > >>>> > >>>> Hello Vlastimil, > >>>> > >>>> On 4 February 2015 at 14:46, Vlastimil Babka wrote: > >>>>>>> > >>>>>>> - that covers mlocking ok, not sure if the rest fits the "shared pages" > >>>>>>> case > >>>>>>> though. I dont see any check for other kinds of shared pages in the > >>>>>>> code. > >>>>>> > >>>>>> > >>>>>> Agreed. "shared" here seems confused. I've removed it. And I've > >>>>>> added mention of "Huge TLB pages" for this error. > >>>>> > >>>>> > >>>>> Thanks. > >>>> > >>>> > >>>> I also added those cases for MADV_REMOVE, BTW. > >>> > >>> > >>> Right. There's also the following for MADV_REMOVE that needs updating: > >>> > >>> "Currently, only shmfs/tmpfs supports this; other filesystems return with > >>> the error ENOSYS." > >>> > >>> - it's not just shmem/tmpfs anymore. It should be best to refer to > >>> fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to > >>> date. > >>> > >>> - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code is > >>> listed in the ERRORS section. > >> > >> Yup, I recently added that as well, based on a patch from Jan Chaloupka. > >> > >>>>>>>>> - The word "will result" did sound as a guarantee at least to me. So > >>>>>>>>> here it > >>>>>>>>> could be changed to "may result (unless the advice is ignored)"? > >>>>>>>> > >>>>>>>> It's too late to fix documentation. Applications already depends on > >>>>>>>> the > >>>>>>>> beheviour. > >>>>>>> > >>>>>>> Right, so as long as they check for EINVAL, it should be safe. It > >>>>>>> appears > >>>>>>> that > >>>>>>> jemalloc does. > >>>>>> > >>>>>> So, first a brief question: in the cases where the call does not error > >>>>>> out, > >>>>>> are we agreed that in the current implementation, MADV_DONTNEED will > >>>>>> always result in zero-filled pages when the region is faulted back in > >>>>>> (when we consider pages that are not backed by a file)? > >>>>> > >>>>> I'd agree at this point. > >>>> > >>>> Thanks for the confirmation. > >>>> > >>>>> Also we should probably mention anonymously shared pages (shmem). I think > >>>>> they behave the same as file here. > >>>> > >>>> You mean tmpfs here, right? (I don't keep all of the synonyms straight.) > >>> > >>> shmem is tmpfs (that by itself would fit under "files" just fine), but also > >>> sys V segments created by shmget(2) and also mappings created by mmap with > >>> MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to > >>> refer to the full list. > >> > >> So, how about this text: > >> > >> After a successful MADV_DONTNEED operation, the semana?? > >> tics of memory access in the specified region are > >> changed: subsequent accesses of pages in the range > >> will succeed, but will result in either reloading of > >> the memory contents from the underlying mapped file > >> (for shared file mappings, shared anonymous mappings, > >> and shmem-based techniques such as System V shared > >> memory segments) or zero-fill-on-demand pages for > >> anonymous private mappings. > > > > Hmm, I'd like to clarify. > > > > Whether it was intention or not, some of userspace developers thought > > about that syscall drop pages instantly if was no-error return so that > > they will see more free pages(ie, rss for the process will be decreased) > > with keeping the VMA. Can we rely on it? > > I do not know. Michael? It's important to identify difference between MADV_DONTNEED and MADV_FREE so it would be better to clear out in this chance. > > > And we should make error section, too. > > "locked" covers mlock(2) and you said you will add hugetlb. Then, > > VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP? > > special mapping for some drivers? > > I'm open for offers on what to add. I suggests from quote "LWN" http://lwn.net/Articles/162860/ "*special mapping* which is not made up of "normal" pages. It is usually created by device drivers which map special memory areas into user space" > > > One more thing, "The kernel is free to ignore the advice". > > It conflicts "This call does not influence the semantics of the > > application (except in the case of MADV_DONTNEED)" so > > is it okay we can believe "The kernel is free to ingmore the advise > > except MADV_DONTNEED"? > > I decided to just drop the sentence > > The kernel is free to ignore the advice. > > It creates misunderstandings, and does not really add information. Sounds good. > > Cheers, > > Michael > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f54.google.com (mail-pa0-f54.google.com [209.85.220.54]) by kanga.kvack.org (Postfix) with ESMTP id 19E486B0032 for ; Mon, 9 Feb 2015 01:50:57 -0500 (EST) Received: by mail-pa0-f54.google.com with SMTP id kx10so16647515pab.13 for ; Sun, 08 Feb 2015 22:50:56 -0800 (PST) Received: from mail-pd0-f173.google.com (mail-pd0-f173.google.com. [209.85.192.173]) by mx.google.com with ESMTPS id zt10si20895927pbc.18.2015.02.08.22.50.56 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 08 Feb 2015 22:50:56 -0800 (PST) Received: by pdjz10 with SMTP id z10so11693138pdj.9 for ; Sun, 08 Feb 2015 22:50:56 -0800 (PST) Date: Mon, 9 Feb 2015 15:50:45 +0900 From: Minchan Kim Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150209065045.GB32300@blaptop> References: <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205154102.GA20607@dhcp22.suse.cz> <54D4E47E.4020509@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <54D4E47E.4020509@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Michael Kerrisk (man-pages)" Cc: Michal Hocko , Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins On Fri, Feb 06, 2015 at 04:57:50PM +0100, Michael Kerrisk (man-pages) wrote: > Hi Michael > > On 02/05/2015 04:41 PM, Michal Hocko wrote: > > On Wed 04-02-15 20:24:27, Michael Kerrisk wrote: > > [...] > >> So, how about this text: > >> > >> After a successful MADV_DONTNEED operation, the semana?? > >> tics of memory access in the specified region are > >> changed: subsequent accesses of pages in the range > >> will succeed, but will result in either reloading of > >> the memory contents from the underlying mapped file > > > > " > > result in either providing the up-to-date contents of the underlying > > mapped file > > " > > Thanks! I did something like that. See below. > > > Would be more precise IMO because reload might be interpreted as a major > > fault which is not necessarily the case (see below). > > > >> (for shared file mappings, shared anonymous mappings, > >> and shmem-based techniques such as System V shared > >> memory segments) or zero-fill-on-demand pages for > >> anonymous private mappings. > > > > Yes, this wording is better because many users are not aware of > > MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't > > mention that. > > (Michal, would you have a text to propose to add to the mmap(2) page? > Maybe it would be useful to add something there.) > > > > > I am just wondering whether it makes sense to mention that MADV_DONTNEED > > for shared mappings might be surprising and not freeing the backing > > pages thus not really freeing memory until there is a memory > > pressure. But maybe this is too implementation specific for a man > > page. What about the following wording on top of yours? > > " > > Please note that the MADV_DONTNEED hint on shared mappings might not > > lead to immediate freeing of pages in the range. The kernel is free to > > delay this until an appropriate moment. RSS of the calling process will > > be reduced however. > > " > > Thanks! I added this, but dropped in the word "immediately" in the last > sentence, since I assume that was implied. So now we have: > > After a successful MADV_DONTNEED operation, the semana?? > tics of memory access in the specified region are > changed: subsequent accesses of pages in the range will > succeed, but will result in either repopulating the mema?? > ory contents from the up-to-date contents of the undera?? > lying mapped file (for shared file mappings, shared > anonymous mappings, and shmem-based techniques such as > System V shared memory segments) or zero-fill-on-demand > pages for anonymous private mappings. > > Note that, when applied to shared mappings, MADV_DONTa?? > NEED might not lead to immediate freeing of the pages in > the range. The kernel is free to delay freeing the > pages until an appropriate moment. The resident set > size (RSS) of the calling process will be immediately > reduced however. Looks good. So, I can parse it that anonymous private mappings will lead to immediate freeing of the pages in the range so it's clearly different with MADV_FREE. > > The current draft of the page can be found in a branch, > http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_madvise > > Thanks, > > Michael > > > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932755AbbBBQzd (ORCPT ); Mon, 2 Feb 2015 11:55:33 -0500 Received: from cantor2.suse.de ([195.135.220.15]:35793 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932424AbbBBQza (ORCPT ); Mon, 2 Feb 2015 11:55:30 -0500 Date: Mon, 2 Feb 2015 16:55:25 +0000 From: Mel Gorman To: linux-mm@kvack.org Cc: Minchan Kim , Vlastimil Babka , Andrew Morton , linux-kernel@vger.kernel.org Subject: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150202165525.GM2395@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org glibc malloc changed behaviour in glibc 2.10 to have per-thread arenas instead of creating new areans if the existing ones were contended. The decision appears to have been made so the allocator scales better but the downside is that madvise(MADV_DONTNEED) is now called for these per-thread areans during free. This tears down pages that would have previously remained. There is nothing wrong with this decision from a functional point of view but any threaded application that frequently allocates/frees the same-sized region is going to incur the full teardown and refault costs. This is extremely obvious in the ebizzy benchmark. At its core, threads are frequently freeing and allocating buffers of the same size. It is much faster on distributions with older versions of glibc. Profiles showed that a large amount of system CPU time was spent on tearing down and refaulting pages. This patch identifies when a thread is frequently calling MADV_DONTNEED on the same region of memory and starts ignoring the hint. On an 8-core single-socket machine this was the impact on ebizzy using glibc 2.19. ebizzy Overall Throughput 3.19.0-rc6 3.19.0-rc6 vanilla madvise-v1r1 Hmean Rsec-1 12619.93 ( 0.00%) 34807.02 (175.81%) Hmean Rsec-3 33434.19 ( 0.00%) 100733.77 (201.29%) Hmean Rsec-5 45796.68 ( 0.00%) 134257.34 (193.16%) Hmean Rsec-7 53146.93 ( 0.00%) 145512.85 (173.79%) Hmean Rsec-12 55132.87 ( 0.00%) 145560.86 (164.02%) Hmean Rsec-18 54846.52 ( 0.00%) 145120.79 (164.59%) Hmean Rsec-24 54368.95 ( 0.00%) 142733.89 (162.53%) Hmean Rsec-30 54388.86 ( 0.00%) 141424.09 (160.02%) Hmean Rsec-32 54047.11 ( 0.00%) 139151.76 (157.46%) And the system CPU usage was also much reduced 3.19.0-rc6 3.19.0-rc6 vanilla madvise-v1r1 User 2647.19 8347.26 System 5742.90 42.42 Elapsed 1350.60 1350.65 It's even more ridiculous on a 4 socket machine ebizzy Overall Throughput 3.19.0-rc6 3.19.0-rc6 vanilla madvise-v1r1 Hmean Rsec-1 5354.37 ( 0.00%) 12838.61 (139.78%) Hmean Rsec-4 10338.41 ( 0.00%) 50514.52 (388.61%) Hmean Rsec-7 7766.33 ( 0.00%) 88555.30 (1040.25%) Hmean Rsec-12 7188.40 ( 0.00%) 154180.78 (2044.86%) Hmean Rsec-21 7001.82 ( 0.00%) 266555.51 (3706.95%) Hmean Rsec-30 8975.08 ( 0.00%) 314369.88 (3402.70%) Hmean Rsec-48 12136.53 ( 0.00%) 358525.74 (2854.10%) Hmean Rsec-79 12607.37 ( 0.00%) 341646.49 (2609.89%) Hmean Rsec-110 12563.37 ( 0.00%) 338058.65 (2590.83%) Hmean Rsec-141 11701.85 ( 0.00%) 331255.78 (2730.80%) Hmean Rsec-172 10987.39 ( 0.00%) 312003.62 (2739.65%) Hmean Rsec-192 12050.46 ( 0.00%) 296401.88 (2359.67%) 3.19.0-rc6 3.19.0-rc6 vanilla madvise-v1r1 User 4136.44 53506.65 System 50262.68 906.49 Elapsed 1802.07 1801.99 Note in both cases that the elapsed time is similar because the benchmark is configured to run for a fixed duration. MADV_FREE would have a lower cost if the underlying allocator used it but there is no guarantee that allocators will use it. Arguably the kernel has no business preventing an application developer shooting themselves in a foot but this is a case where it's relatively easy to detect the bad behaviour and avoid it. Signed-off-by: Mel Gorman --- fs/exec.c | 4 ++++ include/linux/sched.h | 5 +++++ kernel/fork.c | 5 +++++ mm/madvise.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 70 insertions(+) diff --git a/fs/exec.c b/fs/exec.c index ad8798e26be9..5c691fcc32f4 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1551,6 +1551,10 @@ static int do_execveat_common(int fd, struct filename *filename, current->in_execve = 0; acct_update_integrals(current); task_numa_free(current); + if (current->madvise_state) { + kfree(current->madvise_state); + current->madvise_state = NULL; + } free_bprm(bprm); kfree(pathbuf); putname(filename); diff --git a/include/linux/sched.h b/include/linux/sched.h index 8db31ef98d2f..b6706bdb27fd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1271,6 +1271,9 @@ enum perf_event_task_context { perf_nr_task_contexts, }; +/* mm/madvise.c */ +struct madvise_state_info; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ void *stack; @@ -1637,6 +1640,8 @@ struct task_struct { struct page_frag task_frag; + struct madvise_state_info *madvise_state; + #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif diff --git a/kernel/fork.c b/kernel/fork.c index 4dc2ddade9f1..6d8dd1379240 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -246,6 +246,11 @@ void __put_task_struct(struct task_struct *tsk) delayacct_tsk_free(tsk); put_signal_struct(tsk->signal); + if (current->madvise_state) { + kfree(current->madvise_state); + current->madvise_state = NULL; + } + if (!profile_handoff_task(tsk)) free_task(tsk); } diff --git a/mm/madvise.c b/mm/madvise.c index a271adc93289..907bb0922711 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -19,6 +19,7 @@ #include #include #include +#include /* * Any behaviour which results in changes to the vma->vm_flags needs to @@ -251,6 +252,57 @@ static long madvise_willneed(struct vm_area_struct *vma, return 0; } +#define MADVISE_HASH VMACACHE_HASH +#define MADVISE_STATE_SIZE VMACACHE_SIZE +#define MADVISE_THRESHOLD 8 + +struct madvise_state_info { + unsigned long start; + unsigned long end; + int count; + unsigned long jiffies; +}; + +/* Returns true if userspace is continually dropping the same address range */ +static bool ignore_madvise_hint(unsigned long start, unsigned long end) +{ + int i; + + if (!current->madvise_state) + current->madvise_state = kzalloc(sizeof(struct madvise_state_info) * MADVISE_STATE_SIZE, GFP_KERNEL); + if (!current->madvise_state) + return false; + + i = VMACACHE_HASH(start); + if (current->madvise_state[i].start != start || + current->madvise_state[i].end != end) { + /* cache miss */ + current->madvise_state[i].start = start; + current->madvise_state[i].end = end; + current->madvise_state[i].count = 0; + current->madvise_state[i].jiffies = jiffies; + } else { + /* cache hit */ + unsigned long reset = current->madvise_state[i].jiffies + HZ; + if (time_after(jiffies, reset)) { + /* + * If it is a second since the last madvise on this + * range or since madvise hints got ignored then reset + * the counts and apply the hint again. + */ + current->madvise_state[i].count = 0; + current->madvise_state[i].jiffies = jiffies; + } else + current->madvise_state[i].count++; + + if (current->madvise_state[i].count > MADVISE_THRESHOLD) + return true; + current->madvise_state[i].jiffies = jiffies; + } + + return false; +} + /* * Application no longer needs these pages. If the pages are dirty, * it's OK to just throw them away. The app will be more careful about @@ -278,6 +330,10 @@ static long madvise_dontneed(struct vm_area_struct *vma, if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) return -EINVAL; + /* Ignore hint if madvise is continually dropping the same range */ + if (ignore_madvise_hint(start, end)) + return 0; + if (unlikely(vma->vm_flags & VM_NONLINEAR)) { struct zap_details details = { .nonlinear_vma = vma, From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965291AbbBBWFK (ORCPT ); Mon, 2 Feb 2015 17:05:10 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:34905 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965188AbbBBWFH (ORCPT ); Mon, 2 Feb 2015 17:05:07 -0500 Date: Mon, 2 Feb 2015 14:05:06 -0800 From: Andrew Morton To: Mel Gorman Cc: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-Id: <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> In-Reply-To: <20150202165525.GM2395@suse.de> References: <20150202165525.GM2395@suse.de> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 2 Feb 2015 16:55:25 +0000 Mel Gorman wrote: > glibc malloc changed behaviour in glibc 2.10 to have per-thread arenas > instead of creating new areans if the existing ones were contended. > The decision appears to have been made so the allocator scales better but the > downside is that madvise(MADV_DONTNEED) is now called for these per-thread > areans during free. This tears down pages that would have previously > remained. There is nothing wrong with this decision from a functional point > of view but any threaded application that frequently allocates/frees the > same-sized region is going to incur the full teardown and refault costs. MADV_DONTNEED has been there for many years. How could this problem not have been noticed during glibc 2.10 development/testing? Is there some more recent kernel change which is triggering this? > This patch identifies when a thread is frequently calling MADV_DONTNEED > on the same region of memory and starts ignoring the hint. That's pretty nasty-looking :( And presumably there are all sorts of behaviours which will still trigger the problem but which will avoid the start/end equality test in ignore_madvise_hint()? Really, this is a glibc problem and only a glibc problem. MADV_DONTNEED is unavoidably expensive and glibc is calling MADV_DONTNEED for a region which it *does* need. Is there something preventing this from being addressed within glibc? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755308AbbBBWSa (ORCPT ); Mon, 2 Feb 2015 17:18:30 -0500 Received: from cantor2.suse.de ([195.135.220.15]:47223 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751544AbbBBWS2 (ORCPT ); Mon, 2 Feb 2015 17:18:28 -0500 Date: Mon, 2 Feb 2015 22:18:24 +0000 From: Mel Gorman To: Andrew Morton Cc: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150202221824.GN2395@suse.de> References: <20150202165525.GM2395@suse.de> <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 02, 2015 at 02:05:06PM -0800, Andrew Morton wrote: > On Mon, 2 Feb 2015 16:55:25 +0000 Mel Gorman wrote: > > > glibc malloc changed behaviour in glibc 2.10 to have per-thread arenas > > instead of creating new areans if the existing ones were contended. > > The decision appears to have been made so the allocator scales better but the > > downside is that madvise(MADV_DONTNEED) is now called for these per-thread > > areans during free. This tears down pages that would have previously > > remained. There is nothing wrong with this decision from a functional point > > of view but any threaded application that frequently allocates/frees the > > same-sized region is going to incur the full teardown and refault costs. > > MADV_DONTNEED has been there for many years. How could this problem > not have been noticed during glibc 2.10 development/testing? I do not know. I only spotted it due to switching distributions. Looping allocations and frees of the same sizes is considered inefficient and it might have been dismissed on those grounds. It's probably less noticeable when it only affects threaded applications. > Is there > some more recent kernel change which is triggering this? > Not that I'm aware of. > > This patch identifies when a thread is frequently calling MADV_DONTNEED > > on the same region of memory and starts ignoring the hint. > > That's pretty nasty-looking :( > Yep, it is but we're very limited in terms of what we can do within the kernel here. > And presumably there are all sorts of behaviours which will still > trigger the problem but which will avoid the start/end equality test in > ignore_madvise_hint()? > Yes. I would expect that a simple pattern of multiple allocs followed by multiple frees in a loop would also trigger it. > Really, this is a glibc problem and only a glibc problem. > MADV_DONTNEED is unavoidably expensive and glibc is calling > MADV_DONTNEED for a region which it *does* need. To be fair to glibc, it calls it on a region it *thinks* it doesn't need only to reuse it immediately afterwards because of how the benchmark is implemented. > Is there something > preventing this from being addressed within glibc? I doubt it other than I expect they'll punt it back and blame either the application for being stupid or the kernel for being slow. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933709AbbBBWWl (ORCPT ); Mon, 2 Feb 2015 17:22:41 -0500 Received: from mga14.intel.com ([192.55.52.115]:56797 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933480AbbBBWWi (ORCPT ); Mon, 2 Feb 2015 17:22:38 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.09,508,1418112000"; d="scan'208";a="660543056" Message-ID: <54CFF8AC.6010102@intel.com> Date: Mon, 02 Feb 2015 14:22:36 -0800 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Mel Gorman , linux-mm@kvack.org CC: Minchan Kim , Vlastimil Babka , Andrew Morton , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> In-Reply-To: <20150202165525.GM2395@suse.de> Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/02/2015 08:55 AM, Mel Gorman wrote: > This patch identifies when a thread is frequently calling MADV_DONTNEED > on the same region of memory and starts ignoring the hint. On an 8-core > single-socket machine this was the impact on ebizzy using glibc 2.19. The manpage, at least, claims that we zero-fill after MADV_DONTNEED is called: > MADV_DONTNEED > Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. So if we have anything depending on the behavior that it's _always_ zero-filled after an MADV_DONTNEED, this will break it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965268AbbBBWfn (ORCPT ); Mon, 2 Feb 2015 17:35:43 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:35069 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753728AbbBBWfm (ORCPT ); Mon, 2 Feb 2015 17:35:42 -0500 Date: Mon, 2 Feb 2015 14:35:41 -0800 From: Andrew Morton To: Mel Gorman Cc: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-Id: <20150202143541.1efdd2b571413200cb9a4698@linux-foundation.org> In-Reply-To: <20150202221824.GN2395@suse.de> References: <20150202165525.GM2395@suse.de> <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> <20150202221824.GN2395@suse.de> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 2 Feb 2015 22:18:24 +0000 Mel Gorman wrote: > > Is there something > > preventing this from being addressed within glibc? > > I doubt it other than I expect they'll punt it back and blame either the > application for being stupid or the kernel for being slow. *Is* the application being stupid? What is it actually doing? Something like pthread_routine() { p = malloc(X); do_some(work); free(p); return; } ? If so, that doesn't seem stupid? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965525AbbBCA0h (ORCPT ); Mon, 2 Feb 2015 19:26:37 -0500 Received: from smtp2.provo.novell.com ([137.65.250.81]:42915 "EHLO smtp2.provo.novell.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965354AbbBCA0g (ORCPT ); Mon, 2 Feb 2015 19:26:36 -0500 Message-ID: <1422923185.14964.2.camel@stgolabs.net> Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints From: Davidlohr Bueso To: Andrew Morton Cc: Mel Gorman , linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org Date: Mon, 02 Feb 2015 16:26:25 -0800 In-Reply-To: <20150202143541.1efdd2b571413200cb9a4698@linux-foundation.org> References: <20150202165525.GM2395@suse.de> <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> <20150202221824.GN2395@suse.de> <20150202143541.1efdd2b571413200cb9a4698@linux-foundation.org> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.12.7 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 2015-02-02 at 14:35 -0800, Andrew Morton wrote: > On Mon, 2 Feb 2015 22:18:24 +0000 Mel Gorman wrote: > > > > Is there something > > > preventing this from being addressed within glibc? > > > > I doubt it other than I expect they'll punt it back and blame either the > > application for being stupid or the kernel for being slow. > > *Is* the application being stupid? What is it actually doing? > Something like > > pthread_routine() > { > p = malloc(X); > do_some(work); > free(p); Ebizzy adds a time based loop in there. But yeah, pretty much a standard pthread model. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755612AbbBCITV (ORCPT ); Tue, 3 Feb 2015 03:19:21 -0500 Received: from cantor2.suse.de ([195.135.220.15]:35304 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751409AbbBCITT (ORCPT ); Tue, 3 Feb 2015 03:19:19 -0500 Message-ID: <54D08483.40209@suse.cz> Date: Tue, 03 Feb 2015 09:19:15 +0100 From: Vlastimil Babka User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Dave Hansen , Mel Gorman , linux-mm@kvack.org CC: Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org Subject: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> In-Reply-To: <54CFF8AC.6010102@intel.com> Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [CC linux-api, man pages] On 02/02/2015 11:22 PM, Dave Hansen wrote: > On 02/02/2015 08:55 AM, Mel Gorman wrote: >> This patch identifies when a thread is frequently calling MADV_DONTNEED >> on the same region of memory and starts ignoring the hint. On an 8-core >> single-socket machine this was the impact on ebizzy using glibc 2.19. > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > called: > >> MADV_DONTNEED >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > So if we have anything depending on the behavior that it's _always_ > zero-filled after an MADV_DONTNEED, this will break it. OK, so that's a third person (including me) who understood it as a zero-fill guarantee. I think the man page should be clarified (if it's indeed not guaranteed), or we have a bug. The implementation actually skips MADV_DONTNEED for VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. I'm not sure about VM_PFNMAP, these are probably special enough. For mlock, one could expect that mlocking and MADV_DONTNEED would be in some opposition, but it's not documented in the manpage AFAIK. Neither is the hugetlb case, which could be really unexpected by the user. Next, what the man page says about guarantees: "The kernel is free to ignore the advice." - that would suggest that nothing is guaranteed "This call does not influence the semantics of the application (except in the case of MADV_DONTNEED)" - that depends if the reader understands it as "does influence by MADV_DONTNEED" or "may influence by MADV_DONTNEED" - btw, isn't MADV_DONTFORK another exception that does influence the semantics? And since it's mentioned as a workaround for some hardware, is it OK to ignore this advice? And the part you already cited: "Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill on-demand pages for mappings without an underlying file." - The word "will result" did sound as a guarantee at least to me. So here it could be changed to "may result (unless the advice is ignored)"? And if we agree that there is indeed no guarantee, what's the actual semantic difference from MADV_FREE? I guess none? So there's only a possible perfomance difference? Vlastimil From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755191AbbBCJr3 (ORCPT ); Tue, 3 Feb 2015 04:47:29 -0500 Received: from cantor2.suse.de ([195.135.220.15]:40798 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751202AbbBCJrY (ORCPT ); Tue, 3 Feb 2015 04:47:24 -0500 Date: Tue, 3 Feb 2015 09:47:18 +0000 From: Mel Gorman To: Dave Hansen Cc: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , Andrew Morton , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203094718.GO2395@suse.de> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <54CFF8AC.6010102@intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 02, 2015 at 02:22:36PM -0800, Dave Hansen wrote: > On 02/02/2015 08:55 AM, Mel Gorman wrote: > > This patch identifies when a thread is frequently calling MADV_DONTNEED > > on the same region of memory and starts ignoring the hint. On an 8-core > > single-socket machine this was the impact on ebizzy using glibc 2.19. > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > called: > It also claims that the kernel is free to ignore the advice. > > MADV_DONTNEED > > Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > > associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > > underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > So if we have anything depending on the behavior that it's _always_ > zero-filled after an MADV_DONTNEED, this will break it. True. I'd be surprised if any application depended on that but to be safe, an ignored hint could clear the pages. It would still be cheaper than a full teardown and refault. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755748AbbBCKsN (ORCPT ); Tue, 3 Feb 2015 05:48:13 -0500 Received: from mta-out1.inet.fi ([62.71.2.203]:34490 "EHLO jenni2.inet.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932169AbbBCKsK (ORCPT ); Tue, 3 Feb 2015 05:48:10 -0500 Date: Tue, 3 Feb 2015 12:47:56 +0200 From: "Kirill A. Shutemov" To: Mel Gorman Cc: Dave Hansen , linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , Andrew Morton , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203104756.GB14259@node.dhcp.inet.fi> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <20150203094718.GO2395@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150203094718.GO2395@suse.de> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 03, 2015 at 09:47:18AM +0000, Mel Gorman wrote: > On Mon, Feb 02, 2015 at 02:22:36PM -0800, Dave Hansen wrote: > > On 02/02/2015 08:55 AM, Mel Gorman wrote: > > > This patch identifies when a thread is frequently calling MADV_DONTNEED > > > on the same region of memory and starts ignoring the hint. On an 8-core > > > single-socket machine this was the impact on ebizzy using glibc 2.19. > > > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > > called: > > > > It also claims that the kernel is free to ignore the advice. > > > > MADV_DONTNEED > > > Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > > > associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > > > underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > > > So if we have anything depending on the behavior that it's _always_ > > zero-filled after an MADV_DONTNEED, this will break it. > > True. I'd be surprised if any application depended on that IIUC, jemalloc depends on this[1]. [1] https://github.com/jemalloc/jemalloc/blob/dev/src/chunk_mmap.c#L117 -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932771AbbBCKuQ (ORCPT ); Tue, 3 Feb 2015 05:50:16 -0500 Received: from cantor2.suse.de ([195.135.220.15]:47990 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755729AbbBCKuO (ORCPT ); Tue, 3 Feb 2015 05:50:14 -0500 Date: Tue, 3 Feb 2015 10:50:07 +0000 From: Mel Gorman To: Andrew Morton Cc: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203105007.GP2395@suse.de> References: <20150202165525.GM2395@suse.de> <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> <20150202221824.GN2395@suse.de> <20150202143541.1efdd2b571413200cb9a4698@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150202143541.1efdd2b571413200cb9a4698@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 02, 2015 at 02:35:41PM -0800, Andrew Morton wrote: > On Mon, 2 Feb 2015 22:18:24 +0000 Mel Gorman wrote: > > > > Is there something > > > preventing this from being addressed within glibc? > > > > I doubt it other than I expect they'll punt it back and blame either the > > application for being stupid or the kernel for being slow. > > *Is* the application being stupid? What is it actually doing? Only a little. There is little simulated think time between the allocation and the subsequent free. It means the cost of alloc/free dominates where in "real" applications they would either be reusing buffers if they were constantly needed or the think time would mask the cost of the free. > Something like > > pthread_routine() > { > p = malloc(X); > do_some(work); > free(p); > return; > } > Pretty much. There is a search_mem() function that alloc(copy_size) memcpy search free(copy) A real application might try and avoid the copy or reuse buffers if they encountered this particular problem. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755618AbbBCKxO (ORCPT ); Tue, 3 Feb 2015 05:53:14 -0500 Received: from mta-out1.inet.fi ([62.71.2.227]:39752 "EHLO jenni2.inet.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752291AbbBCKxL (ORCPT ); Tue, 3 Feb 2015 05:53:11 -0500 Date: Tue, 3 Feb 2015 12:53:01 +0200 From: "Kirill A. Shutemov" To: Vlastimil Babka Cc: Dave Hansen , Mel Gorman , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203105301.GC14259@node.dhcp.inet.fi> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54D08483.40209@suse.cz> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: > [CC linux-api, man pages] > > On 02/02/2015 11:22 PM, Dave Hansen wrote: > > On 02/02/2015 08:55 AM, Mel Gorman wrote: > >> This patch identifies when a thread is frequently calling MADV_DONTNEED > >> on the same region of memory and starts ignoring the hint. On an 8-core > >> single-socket machine this was the impact on ebizzy using glibc 2.19. > > > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > > called: > > > >> MADV_DONTNEED > >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > > > So if we have anything depending on the behavior that it's _always_ > > zero-filled after an MADV_DONTNEED, this will break it. > > OK, so that's a third person (including me) who understood it as a zero-fill > guarantee. I think the man page should be clarified (if it's indeed not > guaranteed), or we have a bug. > > The implementation actually skips MADV_DONTNEED for > VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. It doesn't skip. It fails with -EINVAL. Or I miss something. > - The word "will result" did sound as a guarantee at least to me. So here it > could be changed to "may result (unless the advice is ignored)"? It's too late to fix documentation. Applications already depends on the beheviour. -- Kirill A. Shutemov From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755565AbbBCLQJ (ORCPT ); Tue, 3 Feb 2015 06:16:09 -0500 Received: from cantor2.suse.de ([195.135.220.15]:50907 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753451AbbBCLQG (ORCPT ); Tue, 3 Feb 2015 06:16:06 -0500 Date: Tue, 3 Feb 2015 11:16:00 +0000 From: Mel Gorman To: Vlastimil Babka Cc: Dave Hansen , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203111600.GR2395@suse.de> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <54D08483.40209@suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: > [CC linux-api, man pages] > > On 02/02/2015 11:22 PM, Dave Hansen wrote: > > On 02/02/2015 08:55 AM, Mel Gorman wrote: > >> This patch identifies when a thread is frequently calling MADV_DONTNEED > >> on the same region of memory and starts ignoring the hint. On an 8-core > >> single-socket machine this was the impact on ebizzy using glibc 2.19. > > > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > > called: > > > >> MADV_DONTNEED > >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > > > So if we have anything depending on the behavior that it's _always_ > > zero-filled after an MADV_DONTNEED, this will break it. > > OK, so that's a third person (including me) who understood it as a zero-fill > guarantee. I think the man page should be clarified (if it's indeed not > guaranteed), or we have a bug. > > The implementation actually skips MADV_DONTNEED for > VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. > This was the first reason why I did not consider the zero-filling to be a guarantee. That said, at this point I'm also not considering pushing this patch towards the kernel. I agree that this is a glibc bug so I've dropped a line to some glibc people to see what they think the approach should be. > I'm not sure about VM_PFNMAP, these are probably special enough. For mlock, one > could expect that mlocking and MADV_DONTNEED would be in some opposition, but > it's not documented in the manpage AFAIK. Neither is the hugetlb case, which > could be really unexpected by the user. > The equivalent posix page also lacks details on how exactly this flag should behave. hugetlb is sortof special in that it's always backed by a ram-based file where the contents can be refaulted. It gets hairy when the mapping has been created to look anonymous but is not anonymous really. The semantics of hugetlb have always been fuzzy. > Next, what the man page says about guarantees: > > "The kernel is free to ignore the advice." > > - that would suggest that nothing is guaranteed > Yep, another reason why I did not clear the page when ignoring the hint. > "This call does not influence the semantics of the application (except in the > case of MADV_DONTNEED)" > > - that depends if the reader understands it as "does influence by MADV_DONTNEED" > or "may influence by MADV_DONTNEED" > > - btw, isn't MADV_DONTFORK another exception that does influence the semantics? > And since it's mentioned as a workaround for some hardware, is it OK to ignore > this advice? > MADV_DONTFORK is also a Linux-specific extention. It happens to be one that if it gets ignored then the application will be very surprised. > And the part you already cited: > > "Subsequent accesses of pages in this range will succeed, but will result either > in reloading of the memory contents from the underlying mapped file (see > mmap(2)) or zero-fill on-demand pages for mappings without an underlying file." > > - The word "will result" did sound as a guarantee at least to me. So here it > could be changed to "may result (unless the advice is ignored)"? > The wording should be "may result" as there are circumstances where it gets ignored even without this prototype patch. > And if we agree that there is indeed no guarantee, what's the actual semantic > difference from MADV_FREE? I guess none? So there's only a possible perfomance > difference? > Timing. MADV_DONTNEED if it has an effect is immediate, is a heavier operations and RSS is reduced. MADV_FREE only has an impact in the future if there is memory pressure. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755713AbbBCLVc (ORCPT ); Tue, 3 Feb 2015 06:21:32 -0500 Received: from cantor2.suse.de ([195.135.220.15]:51345 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753451AbbBCLV3 (ORCPT ); Tue, 3 Feb 2015 06:21:29 -0500 Date: Tue, 3 Feb 2015 11:21:24 +0000 From: Mel Gorman To: "Kirill A. Shutemov" Cc: Dave Hansen , linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , Andrew Morton , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203112124.GS2395@suse.de> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <20150203094718.GO2395@suse.de> <20150203104756.GB14259@node.dhcp.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150203104756.GB14259@node.dhcp.inet.fi> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 03, 2015 at 12:47:56PM +0200, Kirill A. Shutemov wrote: > On Tue, Feb 03, 2015 at 09:47:18AM +0000, Mel Gorman wrote: > > On Mon, Feb 02, 2015 at 02:22:36PM -0800, Dave Hansen wrote: > > > On 02/02/2015 08:55 AM, Mel Gorman wrote: > > > > This patch identifies when a thread is frequently calling MADV_DONTNEED > > > > on the same region of memory and starts ignoring the hint. On an 8-core > > > > single-socket machine this was the impact on ebizzy using glibc 2.19. > > > > > > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > > > called: > > > > > > > It also claims that the kernel is free to ignore the advice. > > > > > > MADV_DONTNEED > > > > Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > > > > associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > > > > underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > > > > > > So if we have anything depending on the behavior that it's _always_ > > > zero-filled after an MADV_DONTNEED, this will break it. > > > > True. I'd be surprised if any application depended on that > > IIUC, jemalloc depends on this[1]. > > [1] https://github.com/jemalloc/jemalloc/blob/dev/src/chunk_mmap.c#L117 > Hope they never back regions with hugetlb then or fall apart if the process called mlockall -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965517AbbBCLnD (ORCPT ); Tue, 3 Feb 2015 06:43:03 -0500 Received: from cantor2.suse.de ([195.135.220.15]:53423 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965429AbbBCLm4 (ORCPT ); Tue, 3 Feb 2015 06:42:56 -0500 Message-ID: <54D0B43D.8000209@suse.cz> Date: Tue, 03 Feb 2015 12:42:53 +0100 From: Vlastimil Babka User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: "Kirill A. Shutemov" CC: Dave Hansen , Mel Gorman , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> In-Reply-To: <20150203105301.GC14259@node.dhcp.inet.fi> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: > On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: >> [CC linux-api, man pages] >> >> On 02/02/2015 11:22 PM, Dave Hansen wrote: >> > On 02/02/2015 08:55 AM, Mel Gorman wrote: >> >> This patch identifies when a thread is frequently calling MADV_DONTNEED >> >> on the same region of memory and starts ignoring the hint. On an 8-core >> >> single-socket machine this was the impact on ebizzy using glibc 2.19. >> > >> > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is >> > called: >> > >> >> MADV_DONTNEED >> >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources >> >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the >> >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. >> > >> > So if we have anything depending on the behavior that it's _always_ >> > zero-filled after an MADV_DONTNEED, this will break it. >> >> OK, so that's a third person (including me) who understood it as a zero-fill >> guarantee. I think the man page should be clarified (if it's indeed not >> guaranteed), or we have a bug. >> >> The implementation actually skips MADV_DONTNEED for >> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. > > It doesn't skip. It fails with -EINVAL. Or I miss something. No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in this case: * The application is attempting to release locked or shared pages (with MADV_DONTNEED). - that covers mlocking ok, not sure if the rest fits the "shared pages" case though. I dont see any check for other kinds of shared pages in the code. >> - The word "will result" did sound as a guarantee at least to me. So here it >> could be changed to "may result (unless the advice is ignored)"? > > It's too late to fix documentation. Applications already depends on the > beheviour. Right, so as long as they check for EINVAL, it should be safe. It appears that jemalloc does. I still wouldnt be sure just by reading the man page that the clearing is guaranteed whenever I dont get an error return value, though, From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933987AbbBCPV2 (ORCPT ); Tue, 3 Feb 2015 10:21:28 -0500 Received: from cantor2.suse.de ([195.135.220.15]:40509 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754176AbbBCPVY (ORCPT ); Tue, 3 Feb 2015 10:21:24 -0500 Date: Tue, 3 Feb 2015 16:21:21 +0100 From: Michal Hocko To: Mel Gorman Cc: Vlastimil Babka , Dave Hansen , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150203152121.GC8914@dhcp22.suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203111600.GR2395@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150203111600.GR2395@suse.de> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 03-02-15 11:16:00, Mel Gorman wrote: > On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: [...] > > And if we agree that there is indeed no guarantee, what's the actual semantic > > difference from MADV_FREE? I guess none? So there's only a possible perfomance > > difference? > > > > Timing. MADV_DONTNEED if it has an effect is immediate, is a heavier > operations and RSS is reduced. MADV_FREE only has an impact in the future > if there is memory pressure. JFTR. the man page for MADV_FREE has been proposed already (https://lkml.org/lkml/2014/12/5/63 should be the last version AFAIR). I do not see it in the man-pages git tree but the patch was not in time for 3.19 so I guess it will only appear in 3.20. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966444AbbBCQVJ (ORCPT ); Tue, 3 Feb 2015 11:21:09 -0500 Received: from mail-wg0-f48.google.com ([74.125.82.48]:47382 "EHLO mail-wg0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932332AbbBCQVE (ORCPT ); Tue, 3 Feb 2015 11:21:04 -0500 Message-ID: <54D0F56A.9050003@gmail.com> Date: Tue, 03 Feb 2015 17:20:58 +0100 From: "Michael Kerrisk (man-pages)" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Vlastimil Babka , "Kirill A. Shutemov" CC: mtk.manpages@gmail.com, Dave Hansen , Mel Gorman , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, linux-man@vger.kernel.org, Hugh Dickins Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> In-Reply-To: <54D0B43D.8000209@suse.cz> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Vlastimil Thanks for CCing me into this thread. On 02/03/2015 12:42 PM, Vlastimil Babka wrote: > On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: >> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: >>> [CC linux-api, man pages] >>> >>> On 02/02/2015 11:22 PM, Dave Hansen wrote: >>>> On 02/02/2015 08:55 AM, Mel Gorman wrote: >>>>> This patch identifies when a thread is frequently calling MADV_DONTNEED >>>>> on the same region of memory and starts ignoring the hint. On an 8-core >>>>> single-socket machine this was the impact on ebizzy using glibc 2.19. >>>> >>>> The manpage, at least, claims that we zero-fill after MADV_DONTNEED is >>>> called: >>>> >>>>> MADV_DONTNEED >>>>> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources >>>>> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the >>>>> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. >>>> >>>> So if we have anything depending on the behavior that it's _always_ >>>> zero-filled after an MADV_DONTNEED, this will break it. >>> >>> OK, so that's a third person (including me) who understood it as a zero-fill >>> guarantee. I think the man page should be clarified (if it's indeed not >>> guaranteed), or we have a bug. >>> >>> The implementation actually skips MADV_DONTNEED for >>> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. >> >> It doesn't skip. It fails with -EINVAL. Or I miss something. > > No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in > this case: > > * The application is attempting to release locked or shared pages (with > MADV_DONTNEED). Yes, there is that. But the page could be more explicit when discussing MADV_DONTNEED in the main text. I've done that. > - that covers mlocking ok, not sure if the rest fits the "shared pages" case > though. I dont see any check for other kinds of shared pages in the code. Agreed. "shared" here seems confused. I've removed it. And I've added mention of "Huge TLB pages" for this error. >>> - The word "will result" did sound as a guarantee at least to me. So here it >>> could be changed to "may result (unless the advice is ignored)"? >> >> It's too late to fix documentation. Applications already depends on the >> beheviour. > > Right, so as long as they check for EINVAL, it should be safe. It appears that > jemalloc does. So, first a brief question: in the cases where the call does not error out, are we agreed that in the current implementation, MADV_DONTNEED will always result in zero-filled pages when the region is faulted back in (when we consider pages that are not backed by a file)? > I still wouldnt be sure just by reading the man page that the clearing is > guaranteed whenever I dont get an error return value, though, I'm not quite sure what you want here. I mean: if there's an error, then the DONTNEED action didn't occur, right? Therefore, there won't be zero-filled pages. But, for what it's worth, I added "If the operation succeeds" at the start of that sentence beginning "Subsequent accesses...". Now, some history, explaining why the page is a bit of a mess, and for that matter why I could really use more help on it from MM folk (especially in the form of actual patches [1], rather than notes about deficiencies in the documentation), because: ***I simply cannot keep up with all of the details***. Once upon a time (Linux 2.4), there was madvise() with just 5 flags: MADV_NORMAL MADV_RANDOM MADV_SEQUENTIAL MADV_WILLNEED MADV_DONTNEED And already a dozen years ago, *I* added the text about MADV_DONTNEED. Back then, I believe it was true. I'm not sure if it's still true now, but I assume for the moment that it is, and await feedback. And the text saying that the call does not affect the semantics of memory access dates back even further (and was then true, MADV_DONTNEED aside). Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there is a semantic mismatch between the destructive MADV_DONTNEED and POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear on most other implementations. Since the original implementation, numerous pieces of cruft^W^W^W excellent new flags have been overloaded into this one system call. Some of those certainly violated the "does not change the semantics of the application" statement, but, sadly, the kernel developers who implemented MADV_REMOVE or MADV_DONTFORK did not think to send a patch to the man page for those new flags, one that might have noted that the semantics of the application are changed by such flags. Equally sadly, I did overlook to scan the bigger page when *I* added documentation of these flags to those pages, otherwise I might have caught that detail. So, just to repeat, I could really use more help on it from MM folk in the form of actual patches to the man page. Thanks, Michael [1] https://www.kernel.org/doc/man-pages/patches.html -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966482AbbBCQZP (ORCPT ); Tue, 3 Feb 2015 11:25:15 -0500 Received: from mail-we0-f178.google.com ([74.125.82.178]:58298 "EHLO mail-we0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965973AbbBCQZM (ORCPT ); Tue, 3 Feb 2015 11:25:12 -0500 Message-ID: <54D0F664.3070606@gmail.com> Date: Tue, 03 Feb 2015 17:25:08 +0100 From: "Michael Kerrisk (man-pages)" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Michal Hocko , Mel Gorman CC: mtk.manpages@gmail.com, minchan Kim , Dave Hansen , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, linux-man@vger.kernel.org Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203111600.GR2395@suse.de> <20150203152121.GC8914@dhcp22.suse.cz> In-Reply-To: <20150203152121.GC8914@dhcp22.suse.cz> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/03/2015 04:21 PM, Michal Hocko wrote: > On Tue 03-02-15 11:16:00, Mel Gorman wrote: >> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: > [...] >>> And if we agree that there is indeed no guarantee, what's the actual semantic >>> difference from MADV_FREE? I guess none? So there's only a possible perfomance >>> difference? >>> >> >> Timing. MADV_DONTNEED if it has an effect is immediate, is a heavier >> operations and RSS is reduced. MADV_FREE only has an impact in the future >> if there is memory pressure. > > JFTR. the man page for MADV_FREE has been proposed already > (https://lkml.org/lkml/2014/12/5/63 should be the last version AFAIR). I > do not see it in the man-pages git tree but the patch was not in time > for 3.19 so I guess it will only appear in 3.20. > Yikes! That patch was buried in the bottom of a locked filing cabinet in a disused lavatory. I unfortunately don't read every thread that comes my way, especially if it doesn't look like a man-pages patch (i.e., falls in the middle of an LKML thread that starts on another topic, and doesn't see linux-man@). I'll respond to that patch soon. (There are some problems that mean I could not accept it, AFAICT.) Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932681AbbBDAJd (ORCPT ); Tue, 3 Feb 2015 19:09:33 -0500 Received: from mail-pa0-f46.google.com ([209.85.220.46]:57856 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753078AbbBDAJ3 (ORCPT ); Tue, 3 Feb 2015 19:09:29 -0500 Date: Wed, 4 Feb 2015 09:09:21 +0900 From: Minchan Kim To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, mtk.manpages@gmail.com, linux-man@vger.kernel.org, Rik van Riel Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150204000921.GC3583@blaptop> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54D0B43D.8000209@suse.cz> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 03, 2015 at 12:42:53PM +0100, Vlastimil Babka wrote: > On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: > > On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: > >> [CC linux-api, man pages] > >> > >> On 02/02/2015 11:22 PM, Dave Hansen wrote: > >> > On 02/02/2015 08:55 AM, Mel Gorman wrote: > >> >> This patch identifies when a thread is frequently calling MADV_DONTNEED > >> >> on the same region of memory and starts ignoring the hint. On an 8-core > >> >> single-socket machine this was the impact on ebizzy using glibc 2.19. > >> > > >> > The manpage, at least, claims that we zero-fill after MADV_DONTNEED is > >> > called: > >> > > >> >> MADV_DONTNEED > >> >> Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources > >> >> associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the > >> >> underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file. > >> > > >> > So if we have anything depending on the behavior that it's _always_ > >> > zero-filled after an MADV_DONTNEED, this will break it. > >> > >> OK, so that's a third person (including me) who understood it as a zero-fill > >> guarantee. I think the man page should be clarified (if it's indeed not > >> guaranteed), or we have a bug. > >> > >> The implementation actually skips MADV_DONTNEED for > >> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's. > > > > It doesn't skip. It fails with -EINVAL. Or I miss something. > > No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in > this case: > > * The application is attempting to release locked or shared pages (with > MADV_DONTNEED). > > - that covers mlocking ok, not sure if the rest fits the "shared pages" case > though. I dont see any check for other kinds of shared pages in the code. > > >> - The word "will result" did sound as a guarantee at least to me. So here it > >> could be changed to "may result (unless the advice is ignored)"? > > > > It's too late to fix documentation. Applications already depends on the > > beheviour. > > Right, so as long as they check for EINVAL, it should be safe. It appears that > jemalloc does. > > I still wouldnt be sure just by reading the man page that the clearing is > guaranteed whenever I dont get an error return value, though, > IMHO, Man page said "MADV_DONTNEED: Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file." Heap by allocated by malloc(3) is anonymous page so it's a mapping withtout an underlying file so userspace can expect zero-fill. Man page said "EINVAL: The application is attempting to release locked or shared pages (with MADV_DONTNEED)" So, user can expect the call on area by allocated by malloc(3) if he doesn't call mlock will always be successful. Man page said "madivse: This call does not influence the semantics of the application (except in the case of MADV_DONTNEED)" So, we shouldn't break MADV_DONTNEED's semantic which free pages instantly. It's a long time semantic and it was one of arguable issues on MADV_FREE Rik had tried long time ago to replace MADV_DONTNEED with MADV_FREE. -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934036AbbBDNqI (ORCPT ); Wed, 4 Feb 2015 08:46:08 -0500 Received: from cantor2.suse.de ([195.135.220.15]:44516 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754262AbbBDNqD (ORCPT ); Wed, 4 Feb 2015 08:46:03 -0500 Message-ID: <54D22298.3040504@suse.cz> Date: Wed, 04 Feb 2015 14:46:00 +0100 From: Vlastimil Babka User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: "Michael Kerrisk (man-pages)" , "Kirill A. Shutemov" CC: Dave Hansen , Mel Gorman , linux-mm@kvack.org, Minchan Kim , Andrew Morton , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, linux-man@vger.kernel.org, Hugh Dickins Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> In-Reply-To: <54D0F56A.9050003@gmail.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/03/2015 05:20 PM, Michael Kerrisk (man-pages) wrote: > Hello Vlastimil > > Thanks for CCing me into this thread. NP > On 02/03/2015 12:42 PM, Vlastimil Babka wrote: >> On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: >>> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: >>> >>> It doesn't skip. It fails with -EINVAL. Or I miss something. >> >> No, I missed that. Thanks for pointing out. The manpage also explains EINVAL in >> this case: >> >> * The application is attempting to release locked or shared pages (with >> MADV_DONTNEED). > > Yes, there is that. But the page could be more explicit when discussing > MADV_DONTNEED in the main text. I've done that. > >> - that covers mlocking ok, not sure if the rest fits the "shared pages" case >> though. I dont see any check for other kinds of shared pages in the code. > > Agreed. "shared" here seems confused. I've removed it. And I've > added mention of "Huge TLB pages" for this error. > Thanks. >>>> - The word "will result" did sound as a guarantee at least to me. So here it >>>> could be changed to "may result (unless the advice is ignored)"? >>> >>> It's too late to fix documentation. Applications already depends on the >>> beheviour. >> >> Right, so as long as they check for EINVAL, it should be safe. It appears that >> jemalloc does. > > So, first a brief question: in the cases where the call does not error out, > are we agreed that in the current implementation, MADV_DONTNEED will > always result in zero-filled pages when the region is faulted back in > (when we consider pages that are not backed by a file)? I'd agree at this point. Also we should probably mention anonymously shared pages (shmem). I think they behave the same as file here. >> I still wouldnt be sure just by reading the man page that the clearing is >> guaranteed whenever I dont get an error return value, though, > > I'm not quite sure what you want here. I mean: if there's an error, I was just reiterating that the guarantee is not clear from if you consider all the statements in the man page. > then the DONTNEED action didn't occur, right? Therefore, there won't > be zero-filled pages. But, for what it's worth, I added "If the > operation succeeds" at the start of that sentence beginning "Subsequent > accesses...". Yes, that should clarify it. Thanks! > Now, some history, explaining why the page is a bit of a mess, > and for that matter why I could really use more help on it from MM > folk (especially in the form of actual patches [1], rather than notes > about deficiencies in the documentation), because: > > ***I simply cannot keep up with all of the details***. I see, and expected it would be like this. I would just send patch if the situation was clear, but here we should agree first, and I thought you should be involved from the beginning. > Once upon a time (Linux 2.4), there was madvise() with just 5 flags: > > MADV_NORMAL > MADV_RANDOM > MADV_SEQUENTIAL > MADV_WILLNEED > MADV_DONTNEED > > And already a dozen years ago, *I* added the text about MADV_DONTNEED. > Back then, I believe it was true. I'm not sure if it's still true now, > but I assume for the moment that it is, and await feedback. And the > text saying that the call does not affect the semantics of memory > access dates back even further (and was then true, MADV_DONTNEED aside). > > Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there > is a semantic mismatch between the destructive MADV_DONTNEED and > POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear > on most other implementations. > > Since the original implementation, numerous pieces of cruft^W^W^W > excellent new flags have been overloaded into this one system call. > Some of those certainly violated the "does not change the semantics > of the application" statement, but, sadly, the kernel developers who > implemented MADV_REMOVE or MADV_DONTFORK did not think to send a > patch to the man page for those new flags, one that might have noted > that the semantics of the application are changed by such flags. Equally > sadly, I did overlook to scan the bigger page when *I* added > documentation of these flags to those pages, otherwise I might have > caught that detail. > > So, just to repeat, I could really use more help on it from MM > folk in the form of actual patches to the man page. Thanks for the background. I'll try to remember to check for man-pages part when I review some api changing patch. > Thanks, > > Michael > > [1] https://www.kernel.org/doc/man-pages/patches.html > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933993AbbBDOBY (ORCPT ); Wed, 4 Feb 2015 09:01:24 -0500 Received: from mail-la0-f49.google.com ([209.85.215.49]:58591 "EHLO mail-la0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755024AbbBDOBW (ORCPT ); Wed, 4 Feb 2015 09:01:22 -0500 MIME-Version: 1.0 Reply-To: mtk.manpages@gmail.com In-Reply-To: <54D22298.3040504@suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> From: "Michael Kerrisk (man-pages)" Date: Wed, 4 Feb 2015 15:00:59 +0100 Message-ID: Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Vlastimil, On 4 February 2015 at 14:46, Vlastimil Babka wrote: > On 02/03/2015 05:20 PM, Michael Kerrisk (man-pages) wrote: >> >> On 02/03/2015 12:42 PM, Vlastimil Babka wrote: >>> >>> On 02/03/2015 11:53 AM, Kirill A. Shutemov wrote: >>>> >>>> On Tue, Feb 03, 2015 at 09:19:15AM +0100, Vlastimil Babka wrote: >>>> >>>> It doesn't skip. It fails with -EINVAL. Or I miss something. >>> >>> >>> No, I missed that. Thanks for pointing out. The manpage also explains >>> EINVAL in >>> this case: >>> >>> * The application is attempting to release locked or shared pages (with >>> MADV_DONTNEED). >> >> Yes, there is that. But the page could be more explicit when discussing >> MADV_DONTNEED in the main text. I've done that. >> >>> - that covers mlocking ok, not sure if the rest fits the "shared pages" >>> case >>> though. I dont see any check for other kinds of shared pages in the code. >> >> Agreed. "shared" here seems confused. I've removed it. And I've >> added mention of "Huge TLB pages" for this error. > > Thanks. I also added those cases for MADV_REMOVE, BTW. >>>>> - The word "will result" did sound as a guarantee at least to me. So >>>>> here it >>>>> could be changed to "may result (unless the advice is ignored)"? >>>> >>>> It's too late to fix documentation. Applications already depends on the >>>> beheviour. >>> >>> Right, so as long as they check for EINVAL, it should be safe. It appears >>> that >>> jemalloc does. >> >> >> So, first a brief question: in the cases where the call does not error >> out, >> are we agreed that in the current implementation, MADV_DONTNEED will >> always result in zero-filled pages when the region is faulted back in >> (when we consider pages that are not backed by a file)? > > > I'd agree at this point. Thanks for the confirmation. > Also we should probably mention anonymously shared pages (shmem). I think > they behave the same as file here. You mean tmpfs here, right? (I don't keep all of the synonyms straight.) >>> I still wouldnt be sure just by reading the man page that the clearing is >>> guaranteed whenever I dont get an error return value, though, >> >> I'm not quite sure what you want here. I mean: if there's an error, > > I was just reiterating that the guarantee is not clear from if you consider > all the statements in the man page. > >> then the DONTNEED action didn't occur, right? Therefore, there won't >> be zero-filled pages. But, for what it's worth, I added "If the >> operation succeeds" at the start of that sentence beginning "Subsequent >> accesses...". > > Yes, that should clarify it. Thanks! Okay. >> Now, some history, explaining why the page is a bit of a mess, >> and for that matter why I could really use more help on it from MM >> folk (especially in the form of actual patches [1], rather than notes >> about deficiencies in the documentation), because: >> >> ***I simply cannot keep up with all of the details***. > > I see, and expected it would be like this. I would just send patch if the > situation was clear, but here we should agree first, and I thought you > should be involved from the beginning. Sorry -- I should have made it clearer, this statement was not targeted at you personally, or even necessarily at this particular thread. It was a general comment, that came up sharply to me as I looked at how much cruft there is in the madvise() page. >> Once upon a time (Linux 2.4), there was madvise() with just 5 flags: >> >> MADV_NORMAL >> MADV_RANDOM >> MADV_SEQUENTIAL >> MADV_WILLNEED >> MADV_DONTNEED >> >> And already a dozen years ago, *I* added the text about MADV_DONTNEED. >> Back then, I believe it was true. I'm not sure if it's still true now, >> but I assume for the moment that it is, and await feedback. And the >> text saying that the call does not affect the semantics of memory >> access dates back even further (and was then true, MADV_DONTNEED aside). >> >> Those 5 flags have analogs in POSIX's posix_madvise() (albeit, there >> is a semantic mismatch between the destructive MADV_DONTNEED and >> POSIX's nondestructive POSIX_MADV_DONTNEED). They also appear >> on most other implementations. >> >> Since the original implementation, numerous pieces of cruft^W^W^W >> excellent new flags have been overloaded into this one system call. >> Some of those certainly violated the "does not change the semantics >> of the application" statement, but, sadly, the kernel developers who >> implemented MADV_REMOVE or MADV_DONTFORK did not think to send a >> patch to the man page for those new flags, one that might have noted >> that the semantics of the application are changed by such flags. Equally >> sadly, I did overlook to scan the bigger page when *I* added >> documentation of these flags to those pages, otherwise I might have >> caught that detail. >> >> So, just to repeat, I could really use more help on it from MM >> folk in the form of actual patches to the man page. > > Thanks for the background. I'll try to remember to check for man-pages part > when I review some api changing patch. That would be great. Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966765AbbBDRCJ (ORCPT ); Wed, 4 Feb 2015 12:02:09 -0500 Received: from cantor2.suse.de ([195.135.220.15]:59113 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755526AbbBDRCH (ORCPT ); Wed, 4 Feb 2015 12:02:07 -0500 Message-ID: <54D2508A.9030804@suse.cz> Date: Wed, 04 Feb 2015 18:02:02 +0100 From: Vlastimil Babka User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: mtk.manpages@gmail.com CC: "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: > Hello Vlastimil, > > On 4 February 2015 at 14:46, Vlastimil Babka wrote: >>>> - that covers mlocking ok, not sure if the rest fits the "shared pages" >>>> case >>>> though. I dont see any check for other kinds of shared pages in the code. >>> >>> Agreed. "shared" here seems confused. I've removed it. And I've >>> added mention of "Huge TLB pages" for this error. >> >> Thanks. > > I also added those cases for MADV_REMOVE, BTW. Right. There's also the following for MADV_REMOVE that needs updating: "Currently, only shmfs/tmpfs supports this; other filesystems return with the error ENOSYS." - it's not just shmem/tmpfs anymore. It should be best to refer to fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to date. - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code is listed in the ERRORS section. >>>>>> - The word "will result" did sound as a guarantee at least to me. So >>>>>> here it >>>>>> could be changed to "may result (unless the advice is ignored)"? >>>>> >>>>> It's too late to fix documentation. Applications already depends on the >>>>> beheviour. >>>> >>>> Right, so as long as they check for EINVAL, it should be safe. It appears >>>> that >>>> jemalloc does. >>> >>> >>> So, first a brief question: in the cases where the call does not error >>> out, >>> are we agreed that in the current implementation, MADV_DONTNEED will >>> always result in zero-filled pages when the region is faulted back in >>> (when we consider pages that are not backed by a file)? >> >> >> I'd agree at this point. > > Thanks for the confirmation. > >> Also we should probably mention anonymously shared pages (shmem). I think >> they behave the same as file here. > > You mean tmpfs here, right? (I don't keep all of the synonyms straight.) shmem is tmpfs (that by itself would fit under "files" just fine), but also sys V segments created by shmget(2) and also mappings created by mmap with MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to refer to the full list. Thanks, Vlastimil From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161326AbbBDTYw (ORCPT ); Wed, 4 Feb 2015 14:24:52 -0500 Received: from mail-la0-f43.google.com ([209.85.215.43]:35307 "EHLO mail-la0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752723AbbBDTYt convert rfc822-to-8bit (ORCPT ); Wed, 4 Feb 2015 14:24:49 -0500 MIME-Version: 1.0 Reply-To: mtk.manpages@gmail.com In-Reply-To: <54D2508A.9030804@suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> From: "Michael Kerrisk (man-pages)" Date: Wed, 4 Feb 2015 20:24:27 +0100 Message-ID: Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints To: Vlastimil Babka Cc: "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4 February 2015 at 18:02, Vlastimil Babka wrote: > On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: >> >> Hello Vlastimil, >> >> On 4 February 2015 at 14:46, Vlastimil Babka wrote: >>>>> >>>>> - that covers mlocking ok, not sure if the rest fits the "shared pages" >>>>> case >>>>> though. I dont see any check for other kinds of shared pages in the >>>>> code. >>>> >>>> >>>> Agreed. "shared" here seems confused. I've removed it. And I've >>>> added mention of "Huge TLB pages" for this error. >>> >>> >>> Thanks. >> >> >> I also added those cases for MADV_REMOVE, BTW. > > > Right. There's also the following for MADV_REMOVE that needs updating: > > "Currently, only shmfs/tmpfs supports this; other filesystems return with > the error ENOSYS." > > - it's not just shmem/tmpfs anymore. It should be best to refer to > fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to > date. > > - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code is > listed in the ERRORS section. Yup, I recently added that as well, based on a patch from Jan Chaloupka. >>>>>>> - The word "will result" did sound as a guarantee at least to me. So >>>>>>> here it >>>>>>> could be changed to "may result (unless the advice is ignored)"? >>>>>> >>>>>> It's too late to fix documentation. Applications already depends on >>>>>> the >>>>>> beheviour. >>>>> >>>>> Right, so as long as they check for EINVAL, it should be safe. It >>>>> appears >>>>> that >>>>> jemalloc does. >>>> >>>> So, first a brief question: in the cases where the call does not error >>>> out, >>>> are we agreed that in the current implementation, MADV_DONTNEED will >>>> always result in zero-filled pages when the region is faulted back in >>>> (when we consider pages that are not backed by a file)? >>> >>> I'd agree at this point. >> >> Thanks for the confirmation. >> >>> Also we should probably mention anonymously shared pages (shmem). I think >>> they behave the same as file here. >> >> You mean tmpfs here, right? (I don't keep all of the synonyms straight.) > > shmem is tmpfs (that by itself would fit under "files" just fine), but also > sys V segments created by shmget(2) and also mappings created by mmap with > MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to > refer to the full list. So, how about this text: After a successful MADV_DONTNEED operation, the seman‐ tics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either reloading of the memory contents from the underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings. Thanks, Michael From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756241AbbBEBIM (ORCPT ); Wed, 4 Feb 2015 20:08:12 -0500 Received: from mail-pa0-f54.google.com ([209.85.220.54]:52346 "EHLO mail-pa0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751283AbbBEBIJ (ORCPT ); Wed, 4 Feb 2015 20:08:09 -0500 Date: Thu, 5 Feb 2015 10:07:58 +0900 From: Minchan Kim To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150205010757.GA20996@blaptop> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-pages) wrote: > On 4 February 2015 at 18:02, Vlastimil Babka wrote: > > On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: > >> > >> Hello Vlastimil, > >> > >> On 4 February 2015 at 14:46, Vlastimil Babka wrote: > >>>>> > >>>>> - that covers mlocking ok, not sure if the rest fits the "shared pages" > >>>>> case > >>>>> though. I dont see any check for other kinds of shared pages in the > >>>>> code. > >>>> > >>>> > >>>> Agreed. "shared" here seems confused. I've removed it. And I've > >>>> added mention of "Huge TLB pages" for this error. > >>> > >>> > >>> Thanks. > >> > >> > >> I also added those cases for MADV_REMOVE, BTW. > > > > > > Right. There's also the following for MADV_REMOVE that needs updating: > > > > "Currently, only shmfs/tmpfs supports this; other filesystems return with > > the error ENOSYS." > > > > - it's not just shmem/tmpfs anymore. It should be best to refer to > > fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to > > date. > > > > - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code is > > listed in the ERRORS section. > > Yup, I recently added that as well, based on a patch from Jan Chaloupka. > > >>>>>>> - The word "will result" did sound as a guarantee at least to me. So > >>>>>>> here it > >>>>>>> could be changed to "may result (unless the advice is ignored)"? > >>>>>> > >>>>>> It's too late to fix documentation. Applications already depends on > >>>>>> the > >>>>>> beheviour. > >>>>> > >>>>> Right, so as long as they check for EINVAL, it should be safe. It > >>>>> appears > >>>>> that > >>>>> jemalloc does. > >>>> > >>>> So, first a brief question: in the cases where the call does not error > >>>> out, > >>>> are we agreed that in the current implementation, MADV_DONTNEED will > >>>> always result in zero-filled pages when the region is faulted back in > >>>> (when we consider pages that are not backed by a file)? > >>> > >>> I'd agree at this point. > >> > >> Thanks for the confirmation. > >> > >>> Also we should probably mention anonymously shared pages (shmem). I think > >>> they behave the same as file here. > >> > >> You mean tmpfs here, right? (I don't keep all of the synonyms straight.) > > > > shmem is tmpfs (that by itself would fit under "files" just fine), but also > > sys V segments created by shmget(2) and also mappings created by mmap with > > MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to > > refer to the full list. > > So, how about this text: > > After a successful MADV_DONTNEED operation, the seman‐ > tics of memory access in the specified region are > changed: subsequent accesses of pages in the range > will succeed, but will result in either reloading of > the memory contents from the underlying mapped file > (for shared file mappings, shared anonymous mappings, > and shmem-based techniques such as System V shared > memory segments) or zero-fill-on-demand pages for > anonymous private mappings. Hmm, I'd like to clarify. Whether it was intention or not, some of userspace developers thought about that syscall drop pages instantly if was no-error return so that they will see more free pages(ie, rss for the process will be decreased) with keeping the VMA. Can we rely on it? And we should make error section, too. "locked" covers mlock(2) and you said you will add hugetlb. Then, VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP? special mapping for some drivers? One more thing, "The kernel is free to ignore the advice". It conflicts "This call does not influence the semantics of the application (except in the case of MADV_DONTNEED)" so is it okay we can believe "The kernel is free to ingmore the advise except MADV_DONTNEED"? Thanks. -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758127AbbBEPlK (ORCPT ); Thu, 5 Feb 2015 10:41:10 -0500 Received: from cantor2.suse.de ([195.135.220.15]:57062 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753763AbbBEPlG (ORCPT ); Thu, 5 Feb 2015 10:41:06 -0500 Date: Thu, 5 Feb 2015 16:41:02 +0100 From: Michal Hocko To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150205154102.GA20607@dhcp22.suse.cz> References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 04-02-15 20:24:27, Michael Kerrisk wrote: [...] > So, how about this text: > > After a successful MADV_DONTNEED operation, the seman‐ > tics of memory access in the specified region are > changed: subsequent accesses of pages in the range > will succeed, but will result in either reloading of > the memory contents from the underlying mapped file " result in either providing the up-to-date contents of the underlying mapped file " Would be more precise IMO because reload might be interpreted as a major fault which is not necessarily the case (see below). > (for shared file mappings, shared anonymous mappings, > and shmem-based techniques such as System V shared > memory segments) or zero-fill-on-demand pages for > anonymous private mappings. Yes, this wording is better because many users are not aware of MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't mention that. I am just wondering whether it makes sense to mention that MADV_DONTNEED for shared mappings might be surprising and not freeing the backing pages thus not really freeing memory until there is a memory pressure. But maybe this is too implementation specific for a man page. What about the following wording on top of yours? " Please note that the MADV_DONTNEED hint on shared mappings might not lead to immediate freeing of pages in the range. The kernel is free to delay this until an appropriate moment. RSS of the calling process will be reduced however. " -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752215AbbBEVo4 (ORCPT ); Thu, 5 Feb 2015 16:44:56 -0500 Received: from mx1.redhat.com ([209.132.183.28]:48958 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751261AbbBEVoz (ORCPT ); Thu, 5 Feb 2015 16:44:55 -0500 Message-ID: <54D3E44B.7060501@redhat.com> Date: Thu, 05 Feb 2015 16:44:43 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Mel Gorman , Andrew Morton CC: linux-mm@kvack.org, Minchan Kim , Vlastimil Babka , linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <20150202140506.392ff6920743f19ea44cff59@linux-foundation.org> <20150202221824.GN2395@suse.de> In-Reply-To: <20150202221824.GN2395@suse.de> Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/02/2015 05:18 PM, Mel Gorman wrote: > On Mon, Feb 02, 2015 at 02:05:06PM -0800, Andrew Morton wrote: >> On Mon, 2 Feb 2015 16:55:25 +0000 Mel Gorman wrote: >> >>> glibc malloc changed behaviour in glibc 2.10 to have per-thread arenas >>> instead of creating new areans if the existing ones were contended. >>> The decision appears to have been made so the allocator scales better but the >>> downside is that madvise(MADV_DONTNEED) is now called for these per-thread >>> areans during free. This tears down pages that would have previously >>> remained. There is nothing wrong with this decision from a functional point >>> of view but any threaded application that frequently allocates/frees the >>> same-sized region is going to incur the full teardown and refault costs. >> >> MADV_DONTNEED has been there for many years. How could this problem >> not have been noticed during glibc 2.10 development/testing? > > I do not know. I only spotted it due to switching distributions. Looping > allocations and frees of the same sizes is considered inefficient and it > might have been dismissed on those grounds. It's probably less noticeable > when it only affects threaded applications. > >> Is there >> some more recent kernel change which is triggering this? >> > > Not that I'm aware of. > >>> This patch identifies when a thread is frequently calling MADV_DONTNEED >>> on the same region of memory and starts ignoring the hint. >> >> That's pretty nasty-looking :( >> > > Yep, it is but we're very limited in terms of what we can do within the > kernel here. > >> And presumably there are all sorts of behaviours which will still >> trigger the problem but which will avoid the start/end equality test in >> ignore_madvise_hint()? >> > > Yes. I would expect that a simple pattern of multiple allocs followed by > multiple frees in a loop would also trigger it. > >> Really, this is a glibc problem and only a glibc problem. >> MADV_DONTNEED is unavoidably expensive and glibc is calling >> MADV_DONTNEED for a region which it *does* need. > > To be fair to glibc, it calls it on a region it *thinks* it doesn't need only > to reuse it immediately afterwards because of how the benchmark is > implemented. > >> Is there something >> preventing this from being addressed within glibc? > > I doubt it other than I expect they'll punt it back and blame either the > application for being stupid or the kernel for being slow. This sounds like something that could benefit from Minchan's MADV_FREE, instead of MADV_DONTNEED. If non page aligned malloc/free does not depend on pages being zeroed, I suspect an MADV_DONTNEED resulting from a malloc/free loop also does not depend on it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756327AbbBFPl0 (ORCPT ); Fri, 6 Feb 2015 10:41:26 -0500 Received: from mail-we0-f176.google.com ([74.125.82.176]:37042 "EHLO mail-we0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754337AbbBFPlX (ORCPT ); Fri, 6 Feb 2015 10:41:23 -0500 Message-ID: <54D4E098.8050004@gmail.com> Date: Fri, 06 Feb 2015 16:41:12 +0100 From: "Michael Kerrisk (man-pages)" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Minchan Kim CC: mtk.manpages@gmail.com, Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205010757.GA20996@blaptop> In-Reply-To: <20150205010757.GA20996@blaptop> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/05/2015 02:07 AM, Minchan Kim wrote: > Hello, > > On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-pages) wrote: >> On 4 February 2015 at 18:02, Vlastimil Babka wrote: >>> On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: >>>> >>>> Hello Vlastimil, >>>> >>>> On 4 February 2015 at 14:46, Vlastimil Babka wrote: >>>>>>> >>>>>>> - that covers mlocking ok, not sure if the rest fits the "shared pages" >>>>>>> case >>>>>>> though. I dont see any check for other kinds of shared pages in the >>>>>>> code. >>>>>> >>>>>> >>>>>> Agreed. "shared" here seems confused. I've removed it. And I've >>>>>> added mention of "Huge TLB pages" for this error. >>>>> >>>>> >>>>> Thanks. >>>> >>>> >>>> I also added those cases for MADV_REMOVE, BTW. >>> >>> >>> Right. There's also the following for MADV_REMOVE that needs updating: >>> >>> "Currently, only shmfs/tmpfs supports this; other filesystems return with >>> the error ENOSYS." >>> >>> - it's not just shmem/tmpfs anymore. It should be best to refer to >>> fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to >>> date. >>> >>> - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code is >>> listed in the ERRORS section. >> >> Yup, I recently added that as well, based on a patch from Jan Chaloupka. >> >>>>>>>>> - The word "will result" did sound as a guarantee at least to me. So >>>>>>>>> here it >>>>>>>>> could be changed to "may result (unless the advice is ignored)"? >>>>>>>> >>>>>>>> It's too late to fix documentation. Applications already depends on >>>>>>>> the >>>>>>>> beheviour. >>>>>>> >>>>>>> Right, so as long as they check for EINVAL, it should be safe. It >>>>>>> appears >>>>>>> that >>>>>>> jemalloc does. >>>>>> >>>>>> So, first a brief question: in the cases where the call does not error >>>>>> out, >>>>>> are we agreed that in the current implementation, MADV_DONTNEED will >>>>>> always result in zero-filled pages when the region is faulted back in >>>>>> (when we consider pages that are not backed by a file)? >>>>> >>>>> I'd agree at this point. >>>> >>>> Thanks for the confirmation. >>>> >>>>> Also we should probably mention anonymously shared pages (shmem). I think >>>>> they behave the same as file here. >>>> >>>> You mean tmpfs here, right? (I don't keep all of the synonyms straight.) >>> >>> shmem is tmpfs (that by itself would fit under "files" just fine), but also >>> sys V segments created by shmget(2) and also mappings created by mmap with >>> MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to >>> refer to the full list. >> >> So, how about this text: >> >> After a successful MADV_DONTNEED operation, the seman‐ >> tics of memory access in the specified region are >> changed: subsequent accesses of pages in the range >> will succeed, but will result in either reloading of >> the memory contents from the underlying mapped file >> (for shared file mappings, shared anonymous mappings, >> and shmem-based techniques such as System V shared >> memory segments) or zero-fill-on-demand pages for >> anonymous private mappings. > > Hmm, I'd like to clarify. > > Whether it was intention or not, some of userspace developers thought > about that syscall drop pages instantly if was no-error return so that > they will see more free pages(ie, rss for the process will be decreased) > with keeping the VMA. Can we rely on it? I do not know. Michael? > And we should make error section, too. > "locked" covers mlock(2) and you said you will add hugetlb. Then, > VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP? > special mapping for some drivers? I'm open for offers on what to add. > One more thing, "The kernel is free to ignore the advice". > It conflicts "This call does not influence the semantics of the > application (except in the case of MADV_DONTNEED)" so > is it okay we can believe "The kernel is free to ingmore the advise > except MADV_DONTNEED"? I decided to just drop the sentence The kernel is free to ignore the advice. It creates misunderstandings, and does not really add information. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755963AbbBFP6D (ORCPT ); Fri, 6 Feb 2015 10:58:03 -0500 Received: from mail-wi0-f177.google.com ([209.85.212.177]:40385 "EHLO mail-wi0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755130AbbBFP6A (ORCPT ); Fri, 6 Feb 2015 10:58:00 -0500 Message-ID: <54D4E47E.4020509@gmail.com> Date: Fri, 06 Feb 2015 16:57:50 +0100 From: "Michael Kerrisk (man-pages)" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Michal Hocko CC: mtk.manpages@gmail.com, Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <20150202165525.GM2395@suse.de> <54CFF8AC.6010102@intel.com> <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205154102.GA20607@dhcp22.suse.cz> In-Reply-To: <20150205154102.GA20607@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Michael On 02/05/2015 04:41 PM, Michal Hocko wrote: > On Wed 04-02-15 20:24:27, Michael Kerrisk wrote: > [...] >> So, how about this text: >> >> After a successful MADV_DONTNEED operation, the seman‐ >> tics of memory access in the specified region are >> changed: subsequent accesses of pages in the range >> will succeed, but will result in either reloading of >> the memory contents from the underlying mapped file > > " > result in either providing the up-to-date contents of the underlying > mapped file > " Thanks! I did something like that. See below. > Would be more precise IMO because reload might be interpreted as a major > fault which is not necessarily the case (see below). > >> (for shared file mappings, shared anonymous mappings, >> and shmem-based techniques such as System V shared >> memory segments) or zero-fill-on-demand pages for >> anonymous private mappings. > > Yes, this wording is better because many users are not aware of > MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't > mention that. (Michal, would you have a text to propose to add to the mmap(2) page? Maybe it would be useful to add something there.) > > I am just wondering whether it makes sense to mention that MADV_DONTNEED > for shared mappings might be surprising and not freeing the backing > pages thus not really freeing memory until there is a memory > pressure. But maybe this is too implementation specific for a man > page. What about the following wording on top of yours? > " > Please note that the MADV_DONTNEED hint on shared mappings might not > lead to immediate freeing of pages in the range. The kernel is free to > delay this until an appropriate moment. RSS of the calling process will > be reduced however. > " Thanks! I added this, but dropped in the word "immediately" in the last sentence, since I assume that was implied. So now we have: After a successful MADV_DONTNEED operation, the seman‐ tics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the mem‐ ory contents from the up-to-date contents of the under‐ lying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings. Note that, when applied to shared mappings, MADV_DONT‐ NEED might not lead to immediate freeing of the pages in the range. The kernel is free to delay freeing the pages until an appropriate moment. The resident set size (RSS) of the calling process will be immediately reduced however. The current draft of the page can be found in a branch, http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_madvise Thanks, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755268AbbBFUpm (ORCPT ); Fri, 6 Feb 2015 15:45:42 -0500 Received: from mail-wg0-f52.google.com ([74.125.82.52]:56458 "EHLO mail-wg0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754165AbbBFUpk (ORCPT ); Fri, 6 Feb 2015 15:45:40 -0500 Date: Fri, 6 Feb 2015 21:45:36 +0100 From: Michal Hocko To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Minchan Kim , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150206204536.GA24245@dhcp22.suse.cz> References: <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205154102.GA20607@dhcp22.suse.cz> <54D4E47E.4020509@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <54D4E47E.4020509@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 06-02-15 16:57:50, Michael Kerrisk wrote: [...] > > Yes, this wording is better because many users are not aware of > > MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't > > mention that. > > (Michal, would you have a text to propose to add to the mmap(2) page? > Maybe it would be useful to add something there.) I am half way on vacation, but I can cook a patch after I am back after week. > > I am just wondering whether it makes sense to mention that MADV_DONTNEED > > for shared mappings might be surprising and not freeing the backing > > pages thus not really freeing memory until there is a memory > > pressure. But maybe this is too implementation specific for a man > > page. What about the following wording on top of yours? > > " > > Please note that the MADV_DONTNEED hint on shared mappings might not > > lead to immediate freeing of pages in the range. The kernel is free to > > delay this until an appropriate moment. RSS of the calling process will > > be reduced however. > > " > > Thanks! I added this, but dropped in the word "immediately" in the last > sentence, since I assume that was implied. So now we have: > > After a successful MADV_DONTNEED operation, the seman‐ > tics of memory access in the specified region are > changed: subsequent accesses of pages in the range will > succeed, but will result in either repopulating the mem‐ > ory contents from the up-to-date contents of the under‐ > lying mapped file (for shared file mappings, shared > anonymous mappings, and shmem-based techniques such as > System V shared memory segments) or zero-fill-on-demand > pages for anonymous private mappings. > > Note that, when applied to shared mappings, MADV_DONT‐ > NEED might not lead to immediate freeing of the pages in > the range. The kernel is free to delay freeing the > pages until an appropriate moment. The resident set > size (RSS) of the calling process will be immediately > reduced however. This sounds good to me and it is definitely much better than the current state. Thanks! > The current draft of the page can be found in a branch, > http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_madvise > > Thanks, > > Michael > > > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759452AbbBIGqO (ORCPT ); Mon, 9 Feb 2015 01:46:14 -0500 Received: from mail-pa0-f46.google.com ([209.85.220.46]:48594 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751660AbbBIGqL (ORCPT ); Mon, 9 Feb 2015 01:46:11 -0500 Date: Mon, 9 Feb 2015 15:46:00 +0900 From: Minchan Kim To: "Michael Kerrisk (man-pages)" Cc: Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150209064600.GA32300@blaptop> References: <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205010757.GA20996@blaptop> <54D4E098.8050004@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <54D4E098.8050004@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Michael On Fri, Feb 06, 2015 at 04:41:12PM +0100, Michael Kerrisk (man-pages) wrote: > On 02/05/2015 02:07 AM, Minchan Kim wrote: > > Hello, > > > > On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-pages) wrote: > >> On 4 February 2015 at 18:02, Vlastimil Babka wrote: > >>> On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: > >>>> > >>>> Hello Vlastimil, > >>>> > >>>> On 4 February 2015 at 14:46, Vlastimil Babka wrote: > >>>>>>> > >>>>>>> - that covers mlocking ok, not sure if the rest fits the "shared pages" > >>>>>>> case > >>>>>>> though. I dont see any check for other kinds of shared pages in the > >>>>>>> code. > >>>>>> > >>>>>> > >>>>>> Agreed. "shared" here seems confused. I've removed it. And I've > >>>>>> added mention of "Huge TLB pages" for this error. > >>>>> > >>>>> > >>>>> Thanks. > >>>> > >>>> > >>>> I also added those cases for MADV_REMOVE, BTW. > >>> > >>> > >>> Right. There's also the following for MADV_REMOVE that needs updating: > >>> > >>> "Currently, only shmfs/tmpfs supports this; other filesystems return with > >>> the error ENOSYS." > >>> > >>> - it's not just shmem/tmpfs anymore. It should be best to refer to > >>> fallocate(2) option FALLOC_FL_PUNCH_HOLE which seems to be (more) up to > >>> date. > >>> > >>> - AFAICS it doesn't return ENOSYS but EOPNOTSUPP. Also neither error code is > >>> listed in the ERRORS section. > >> > >> Yup, I recently added that as well, based on a patch from Jan Chaloupka. > >> > >>>>>>>>> - The word "will result" did sound as a guarantee at least to me. So > >>>>>>>>> here it > >>>>>>>>> could be changed to "may result (unless the advice is ignored)"? > >>>>>>>> > >>>>>>>> It's too late to fix documentation. Applications already depends on > >>>>>>>> the > >>>>>>>> beheviour. > >>>>>>> > >>>>>>> Right, so as long as they check for EINVAL, it should be safe. It > >>>>>>> appears > >>>>>>> that > >>>>>>> jemalloc does. > >>>>>> > >>>>>> So, first a brief question: in the cases where the call does not error > >>>>>> out, > >>>>>> are we agreed that in the current implementation, MADV_DONTNEED will > >>>>>> always result in zero-filled pages when the region is faulted back in > >>>>>> (when we consider pages that are not backed by a file)? > >>>>> > >>>>> I'd agree at this point. > >>>> > >>>> Thanks for the confirmation. > >>>> > >>>>> Also we should probably mention anonymously shared pages (shmem). I think > >>>>> they behave the same as file here. > >>>> > >>>> You mean tmpfs here, right? (I don't keep all of the synonyms straight.) > >>> > >>> shmem is tmpfs (that by itself would fit under "files" just fine), but also > >>> sys V segments created by shmget(2) and also mappings created by mmap with > >>> MAP_SHARED | MAP_ANONYMOUS. I'm not sure if there's a single manpage to > >>> refer to the full list. > >> > >> So, how about this text: > >> > >> After a successful MADV_DONTNEED operation, the seman‐ > >> tics of memory access in the specified region are > >> changed: subsequent accesses of pages in the range > >> will succeed, but will result in either reloading of > >> the memory contents from the underlying mapped file > >> (for shared file mappings, shared anonymous mappings, > >> and shmem-based techniques such as System V shared > >> memory segments) or zero-fill-on-demand pages for > >> anonymous private mappings. > > > > Hmm, I'd like to clarify. > > > > Whether it was intention or not, some of userspace developers thought > > about that syscall drop pages instantly if was no-error return so that > > they will see more free pages(ie, rss for the process will be decreased) > > with keeping the VMA. Can we rely on it? > > I do not know. Michael? It's important to identify difference between MADV_DONTNEED and MADV_FREE so it would be better to clear out in this chance. > > > And we should make error section, too. > > "locked" covers mlock(2) and you said you will add hugetlb. Then, > > VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP? > > special mapping for some drivers? > > I'm open for offers on what to add. I suggests from quote "LWN" http://lwn.net/Articles/162860/ "*special mapping* which is not made up of "normal" pages. It is usually created by device drivers which map special memory areas into user space" > > > One more thing, "The kernel is free to ignore the advice". > > It conflicts "This call does not influence the semantics of the > > application (except in the case of MADV_DONTNEED)" so > > is it okay we can believe "The kernel is free to ingmore the advise > > except MADV_DONTNEED"? > > I decided to just drop the sentence > > The kernel is free to ignore the advice. > > It creates misunderstandings, and does not really add information. Sounds good. > > Cheers, > > Michael > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759440AbbBIGu7 (ORCPT ); Mon, 9 Feb 2015 01:50:59 -0500 Received: from mail-pd0-f174.google.com ([209.85.192.174]:36053 "EHLO mail-pd0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754175AbbBIGu4 (ORCPT ); Mon, 9 Feb 2015 01:50:56 -0500 Date: Mon, 9 Feb 2015 15:50:45 +0900 From: Minchan Kim To: "Michael Kerrisk (man-pages)" Cc: Michal Hocko , Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints Message-ID: <20150209065045.GB32300@blaptop> References: <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205154102.GA20607@dhcp22.suse.cz> <54D4E47E.4020509@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <54D4E47E.4020509@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 06, 2015 at 04:57:50PM +0100, Michael Kerrisk (man-pages) wrote: > Hi Michael > > On 02/05/2015 04:41 PM, Michal Hocko wrote: > > On Wed 04-02-15 20:24:27, Michael Kerrisk wrote: > > [...] > >> So, how about this text: > >> > >> After a successful MADV_DONTNEED operation, the seman‐ > >> tics of memory access in the specified region are > >> changed: subsequent accesses of pages in the range > >> will succeed, but will result in either reloading of > >> the memory contents from the underlying mapped file > > > > " > > result in either providing the up-to-date contents of the underlying > > mapped file > > " > > Thanks! I did something like that. See below. > > > Would be more precise IMO because reload might be interpreted as a major > > fault which is not necessarily the case (see below). > > > >> (for shared file mappings, shared anonymous mappings, > >> and shmem-based techniques such as System V shared > >> memory segments) or zero-fill-on-demand pages for > >> anonymous private mappings. > > > > Yes, this wording is better because many users are not aware of > > MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't > > mention that. > > (Michal, would you have a text to propose to add to the mmap(2) page? > Maybe it would be useful to add something there.) > > > > > I am just wondering whether it makes sense to mention that MADV_DONTNEED > > for shared mappings might be surprising and not freeing the backing > > pages thus not really freeing memory until there is a memory > > pressure. But maybe this is too implementation specific for a man > > page. What about the following wording on top of yours? > > " > > Please note that the MADV_DONTNEED hint on shared mappings might not > > lead to immediate freeing of pages in the range. The kernel is free to > > delay this until an appropriate moment. RSS of the calling process will > > be reduced however. > > " > > Thanks! I added this, but dropped in the word "immediately" in the last > sentence, since I assume that was implied. So now we have: > > After a successful MADV_DONTNEED operation, the seman‐ > tics of memory access in the specified region are > changed: subsequent accesses of pages in the range will > succeed, but will result in either repopulating the mem‐ > ory contents from the up-to-date contents of the under‐ > lying mapped file (for shared file mappings, shared > anonymous mappings, and shmem-based techniques such as > System V shared memory segments) or zero-fill-on-demand > pages for anonymous private mappings. > > Note that, when applied to shared mappings, MADV_DONT‐ > NEED might not lead to immediate freeing of the pages in > the range. The kernel is free to delay freeing the > pages until an appropriate moment. The resident set > size (RSS) of the calling process will be immediately > reduced however. Looks good. So, I can parse it that anonymous private mappings will lead to immediate freeing of the pages in the range so it's clearly different with MADV_FREE. > > The current draft of the page can be found in a branch, > http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_madvise > > Thanks, > > Michael > > > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/ -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932685AbbBIJNW (ORCPT ); Mon, 9 Feb 2015 04:13:22 -0500 Received: from mail-wi0-f170.google.com ([209.85.212.170]:59192 "EHLO mail-wi0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932495AbbBIJNK (ORCPT ); Mon, 9 Feb 2015 04:13:10 -0500 Message-ID: <54D87A23.40703@gmail.com> Date: Mon, 09 Feb 2015 10:13:07 +0100 From: "Michael Kerrisk (man-pages)" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Minchan Kim CC: mtk.manpages@gmail.com, Vlastimil Babka , "Kirill A. Shutemov" , Dave Hansen , Mel Gorman , "linux-mm@kvack.org" , Andrew Morton , lkml , Linux API , linux-man , Hugh Dickins Subject: Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints References: <54D08483.40209@suse.cz> <20150203105301.GC14259@node.dhcp.inet.fi> <54D0B43D.8000209@suse.cz> <54D0F56A.9050003@gmail.com> <54D22298.3040504@suse.cz> <54D2508A.9030804@suse.cz> <20150205010757.GA20996@blaptop> <54D4E098.8050004@gmail.com> <20150209064600.GA32300@blaptop> In-Reply-To: <20150209064600.GA32300@blaptop> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Minchan On 02/09/2015 07:46 AM, Minchan Kim wrote: > Hello, Michael > > On Fri, Feb 06, 2015 at 04:41:12PM +0100, Michael Kerrisk (man-pages) wrote: >> On 02/05/2015 02:07 AM, Minchan Kim wrote: >>> Hello, >>> >>> On Wed, Feb 04, 2015 at 08:24:27PM +0100, Michael Kerrisk (man-pages) wrote: >>>> On 4 February 2015 at 18:02, Vlastimil Babka wrote: >>>>> On 02/04/2015 03:00 PM, Michael Kerrisk (man-pages) wrote: [...] >>> And we should make error section, too. >>> "locked" covers mlock(2) and you said you will add hugetlb. Then, >>> VM_PFNMAP? In that case, it fails. How can we say about VM_PFNMAP? >>> special mapping for some drivers? >> >> I'm open for offers on what to add. > > I suggests from quote "LWN" http://lwn.net/Articles/162860/ > "*special mapping* which is not made up of "normal" pages. > It is usually created by device drivers which map special memory areas > into user space" Thanks. I've added mention of VM_PFNMAP in the discussion of both MADV_DONTNEED and MADV_REMOVE, and noted that both of those operations will give an error when applied to VM_PFNMAP pages. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/