* get_zone_device_page() in get_page() and page_cache_get_speculative() [not found] <CAA9_cmf7=aGXKoQFkzS_UJtznfRtWofitDpV2AyGwpaRGKyQkg@mail.gmail.com> @ 2017-04-23 23:31 ` Kirill A. Shutemov 2017-04-24 17:23 ` Dan Williams 2017-04-27 0:55 ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Dan Williams 0 siblings, 2 replies; 43+ messages in thread From: Kirill A. Shutemov @ 2017-04-23 23:31 UTC (permalink / raw) To: Dan Williams, linux-mm Cc: Catalin Marinas, aneesh.kumar, steve.capper, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar, Andrew Morton, Kirill A. Shutemov, H. Peter Anvin, dave.hansen, Borislav Petkov, Rik van Riel, dann.frazier, Linus Torvalds, Michal Hocko, linux-tip-commits On Thu, Apr 20, 2017 at 02:46:51PM -0700, Dan Williams wrote: > On Sat, Mar 18, 2017 at 2:52 AM, tip-bot for Kirill A. Shutemov > <tipbot@zytor.com> wrote: > > Commit-ID: 2947ba054a4dabbd82848728d765346886050029 > > Gitweb: http://git.kernel.org/tip/2947ba054a4dabbd82848728d765346886050029 > > Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > AuthorDate: Fri, 17 Mar 2017 00:39:06 +0300 > > Committer: Ingo Molnar <mingo@kernel.org> > > CommitDate: Sat, 18 Mar 2017 09:48:03 +0100 > > > > x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation > > > > This patch provides all required callbacks required by the generic > > get_user_pages_fast() code and switches x86 over - and removes > > the platform specific implementation. > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com> > > Cc: Borislav Petkov <bp@alien8.de> > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > Cc: Dann Frazier <dann.frazier@canonical.com> > > Cc: Dave Hansen <dave.hansen@intel.com> > > Cc: H. Peter Anvin <hpa@zytor.com> > > Cc: Linus Torvalds <torvalds@linux-foundation.org> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Rik van Riel <riel@redhat.com> > > Cc: Steve Capper <steve.capper@linaro.org> > > Cc: Thomas Gleixner <tglx@linutronix.de> > > Cc: linux-arch@vger.kernel.org > > Cc: linux-mm@kvack.org > > Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com > > [ Minor readability edits. ] > > Signed-off-by: Ingo Molnar <mingo@kernel.org> > > I'm still trying to spot the bug, but bisect points to this patch as > the point at which my unit tests start failing with the following > signature: > > [ 35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 > percpu_ref_switch_to_atomic_rcu+0x1f5/0x200 Okay, I've tracked it down. The issue is triggered by replacment get_page() with page_cache_get_speculative(). page_cache_get_speculative() doesn't have get_zone_device_page(). :-| And I think it's your bug, Dan: it's wrong to have get_/put_zone_device_page() in get_/put_page(). I must be handled by page_ref_* machinery to catch all cases where we manipulate with page refcount. Back to the big picture: I hate that we need to have such additional code in page refcount primitives. I worked hard to remove compound page ugliness from there and now zone_device creeping in... Is it the only option? -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: get_zone_device_page() in get_page() and page_cache_get_speculative() 2017-04-23 23:31 ` get_zone_device_page() in get_page() and page_cache_get_speculative() Kirill A. Shutemov @ 2017-04-24 17:23 ` Dan Williams 2017-04-24 17:30 ` Kirill A. Shutemov 2017-04-27 0:55 ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Dan Williams 1 sibling, 1 reply; 43+ messages in thread From: Dan Williams @ 2017-04-24 17:23 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar, Andrew Morton, Kirill A. Shutemov, H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits On Sun, Apr 23, 2017 at 4:31 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > On Thu, Apr 20, 2017 at 02:46:51PM -0700, Dan Williams wrote: >> On Sat, Mar 18, 2017 at 2:52 AM, tip-bot for Kirill A. Shutemov >> <tipbot@zytor.com> wrote: >> > Commit-ID: 2947ba054a4dabbd82848728d765346886050029 >> > Gitweb: http://git.kernel.org/tip/2947ba054a4dabbd82848728d765346886050029 >> > Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> > AuthorDate: Fri, 17 Mar 2017 00:39:06 +0300 >> > Committer: Ingo Molnar <mingo@kernel.org> >> > CommitDate: Sat, 18 Mar 2017 09:48:03 +0100 >> > >> > x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation >> > >> > This patch provides all required callbacks required by the generic >> > get_user_pages_fast() code and switches x86 over - and removes >> > the platform specific implementation. >> > >> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> > Cc: Andrew Morton <akpm@linux-foundation.org> >> > Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com> >> > Cc: Borislav Petkov <bp@alien8.de> >> > Cc: Catalin Marinas <catalin.marinas@arm.com> >> > Cc: Dann Frazier <dann.frazier@canonical.com> >> > Cc: Dave Hansen <dave.hansen@intel.com> >> > Cc: H. Peter Anvin <hpa@zytor.com> >> > Cc: Linus Torvalds <torvalds@linux-foundation.org> >> > Cc: Peter Zijlstra <peterz@infradead.org> >> > Cc: Rik van Riel <riel@redhat.com> >> > Cc: Steve Capper <steve.capper@linaro.org> >> > Cc: Thomas Gleixner <tglx@linutronix.de> >> > Cc: linux-arch@vger.kernel.org >> > Cc: linux-mm@kvack.org >> > Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com >> > [ Minor readability edits. ] >> > Signed-off-by: Ingo Molnar <mingo@kernel.org> >> >> I'm still trying to spot the bug, but bisect points to this patch as >> the point at which my unit tests start failing with the following >> signature: >> >> [ 35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 >> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200 > > Okay, I've tracked it down. The issue is triggered by replacment > get_page() with page_cache_get_speculative(). > > page_cache_get_speculative() doesn't have get_zone_device_page(). :-| > > And I think it's your bug, Dan: it's wrong to have > get_/put_zone_device_page() in get_/put_page(). I must be handled by > page_ref_* machinery to catch all cases where we manipulate with page > refcount. The page_ref conversion landed in 4.6 *after* the ZONE_DEVICE implementation that landed in 4.5, so there was a missed conversion of the zone-device reference counting to page_ref. > Back to the big picture: > > I hate that we need to have such additional code in page refcount > primitives. I worked hard to remove compound page ugliness from there and > now zone_device creeping in... > > Is it the only option? Not sure, I need to spend some time to understand what page_ref means to ZONE_DEVICE. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: get_zone_device_page() in get_page() and page_cache_get_speculative() 2017-04-24 17:23 ` Dan Williams @ 2017-04-24 17:30 ` Kirill A. Shutemov 2017-04-24 17:47 ` Dan Williams 0 siblings, 1 reply; 43+ messages in thread From: Kirill A. Shutemov @ 2017-04-24 17:30 UTC (permalink / raw) To: Dan Williams Cc: Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar, Andrew Morton, Kirill A. Shutemov, H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits On Mon, Apr 24, 2017 at 10:23:59AM -0700, Dan Williams wrote: > On Sun, Apr 23, 2017 at 4:31 PM, Kirill A. Shutemov > <kirill@shutemov.name> wrote: > > On Thu, Apr 20, 2017 at 02:46:51PM -0700, Dan Williams wrote: > >> On Sat, Mar 18, 2017 at 2:52 AM, tip-bot for Kirill A. Shutemov > >> <tipbot@zytor.com> wrote: > >> > Commit-ID: 2947ba054a4dabbd82848728d765346886050029 > >> > Gitweb: http://git.kernel.org/tip/2947ba054a4dabbd82848728d765346886050029 > >> > Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > >> > AuthorDate: Fri, 17 Mar 2017 00:39:06 +0300 > >> > Committer: Ingo Molnar <mingo@kernel.org> > >> > CommitDate: Sat, 18 Mar 2017 09:48:03 +0100 > >> > > >> > x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation > >> > > >> > This patch provides all required callbacks required by the generic > >> > get_user_pages_fast() code and switches x86 over - and removes > >> > the platform specific implementation. > >> > > >> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > >> > Cc: Andrew Morton <akpm@linux-foundation.org> > >> > Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com> > >> > Cc: Borislav Petkov <bp@alien8.de> > >> > Cc: Catalin Marinas <catalin.marinas@arm.com> > >> > Cc: Dann Frazier <dann.frazier@canonical.com> > >> > Cc: Dave Hansen <dave.hansen@intel.com> > >> > Cc: H. Peter Anvin <hpa@zytor.com> > >> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > >> > Cc: Peter Zijlstra <peterz@infradead.org> > >> > Cc: Rik van Riel <riel@redhat.com> > >> > Cc: Steve Capper <steve.capper@linaro.org> > >> > Cc: Thomas Gleixner <tglx@linutronix.de> > >> > Cc: linux-arch@vger.kernel.org > >> > Cc: linux-mm@kvack.org > >> > Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com > >> > [ Minor readability edits. ] > >> > Signed-off-by: Ingo Molnar <mingo@kernel.org> > >> > >> I'm still trying to spot the bug, but bisect points to this patch as > >> the point at which my unit tests start failing with the following > >> signature: > >> > >> [ 35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 > >> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200 > > > > Okay, I've tracked it down. The issue is triggered by replacment > > get_page() with page_cache_get_speculative(). > > > > page_cache_get_speculative() doesn't have get_zone_device_page(). :-| > > > > And I think it's your bug, Dan: it's wrong to have > > get_/put_zone_device_page() in get_/put_page(). I must be handled by > > page_ref_* machinery to catch all cases where we manipulate with page > > refcount. > > The page_ref conversion landed in 4.6 *after* the ZONE_DEVICE > implementation that landed in 4.5, so there was a missed conversion of > the zone-device reference counting to page_ref. Fair enough. But get_page_unless_zero() definitely predates ZONE_DEVICE. :) -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: get_zone_device_page() in get_page() and page_cache_get_speculative() 2017-04-24 17:30 ` Kirill A. Shutemov @ 2017-04-24 17:47 ` Dan Williams 2017-04-24 18:01 ` Kirill A. Shutemov 0 siblings, 1 reply; 43+ messages in thread From: Dan Williams @ 2017-04-24 17:47 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar, Andrew Morton, Kirill A. Shutemov, H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits On Mon, Apr 24, 2017 at 10:30 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > On Mon, Apr 24, 2017 at 10:23:59AM -0700, Dan Williams wrote: >> On Sun, Apr 23, 2017 at 4:31 PM, Kirill A. Shutemov >> <kirill@shutemov.name> wrote: >> > On Thu, Apr 20, 2017 at 02:46:51PM -0700, Dan Williams wrote: >> >> On Sat, Mar 18, 2017 at 2:52 AM, tip-bot for Kirill A. Shutemov >> >> <tipbot@zytor.com> wrote: >> >> > Commit-ID: 2947ba054a4dabbd82848728d765346886050029 >> >> > Gitweb: http://git.kernel.org/tip/2947ba054a4dabbd82848728d765346886050029 >> >> > Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> >> > AuthorDate: Fri, 17 Mar 2017 00:39:06 +0300 >> >> > Committer: Ingo Molnar <mingo@kernel.org> >> >> > CommitDate: Sat, 18 Mar 2017 09:48:03 +0100 >> >> > >> >> > x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation >> >> > >> >> > This patch provides all required callbacks required by the generic >> >> > get_user_pages_fast() code and switches x86 over - and removes >> >> > the platform specific implementation. >> >> > >> >> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> >> > Cc: Andrew Morton <akpm@linux-foundation.org> >> >> > Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com> >> >> > Cc: Borislav Petkov <bp@alien8.de> >> >> > Cc: Catalin Marinas <catalin.marinas@arm.com> >> >> > Cc: Dann Frazier <dann.frazier@canonical.com> >> >> > Cc: Dave Hansen <dave.hansen@intel.com> >> >> > Cc: H. Peter Anvin <hpa@zytor.com> >> >> > Cc: Linus Torvalds <torvalds@linux-foundation.org> >> >> > Cc: Peter Zijlstra <peterz@infradead.org> >> >> > Cc: Rik van Riel <riel@redhat.com> >> >> > Cc: Steve Capper <steve.capper@linaro.org> >> >> > Cc: Thomas Gleixner <tglx@linutronix.de> >> >> > Cc: linux-arch@vger.kernel.org >> >> > Cc: linux-mm@kvack.org >> >> > Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com >> >> > [ Minor readability edits. ] >> >> > Signed-off-by: Ingo Molnar <mingo@kernel.org> >> >> >> >> I'm still trying to spot the bug, but bisect points to this patch as >> >> the point at which my unit tests start failing with the following >> >> signature: >> >> >> >> [ 35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 >> >> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200 >> > >> > Okay, I've tracked it down. The issue is triggered by replacment >> > get_page() with page_cache_get_speculative(). >> > >> > page_cache_get_speculative() doesn't have get_zone_device_page(). :-| >> > >> > And I think it's your bug, Dan: it's wrong to have >> > get_/put_zone_device_page() in get_/put_page(). I must be handled by >> > page_ref_* machinery to catch all cases where we manipulate with page >> > refcount. >> >> The page_ref conversion landed in 4.6 *after* the ZONE_DEVICE >> implementation that landed in 4.5, so there was a missed conversion of >> the zone-device reference counting to page_ref. > > Fair enough. > > But get_page_unless_zero() definitely predates ZONE_DEVICE. :) > It does, but that's deliberate. A ZONE_DEVICE page never has a zero reference count, it's always owned by the device, never by the page allocator. ZONE_DEVICE overrides the ->lru list_head to store private device information and we rely on the behavior that a non-zero reference means the page is not added to any lru or page cache list. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: get_zone_device_page() in get_page() and page_cache_get_speculative() 2017-04-24 17:47 ` Dan Williams @ 2017-04-24 18:01 ` Kirill A. Shutemov 2017-04-24 18:25 ` Kirill A. Shutemov 0 siblings, 1 reply; 43+ messages in thread From: Kirill A. Shutemov @ 2017-04-24 18:01 UTC (permalink / raw) To: Dan Williams Cc: Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar, Andrew Morton, Kirill A. Shutemov, H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits On Mon, Apr 24, 2017 at 10:47:43AM -0700, Dan Williams wrote: > On Mon, Apr 24, 2017 at 10:30 AM, Kirill A. Shutemov > >> >> [ 35.423841] WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 > >> >> percpu_ref_switch_to_atomic_rcu+0x1f5/0x200 > >> > > >> > Okay, I've tracked it down. The issue is triggered by replacment > >> > get_page() with page_cache_get_speculative(). > >> > > >> > page_cache_get_speculative() doesn't have get_zone_device_page(). :-| > >> > > >> > And I think it's your bug, Dan: it's wrong to have > >> > get_/put_zone_device_page() in get_/put_page(). I must be handled by > >> > page_ref_* machinery to catch all cases where we manipulate with page > >> > refcount. > >> > >> The page_ref conversion landed in 4.6 *after* the ZONE_DEVICE > >> implementation that landed in 4.5, so there was a missed conversion of > >> the zone-device reference counting to page_ref. > > > > Fair enough. > > > > But get_page_unless_zero() definitely predates ZONE_DEVICE. :) > > > > It does, but that's deliberate. A ZONE_DEVICE page never has a zero > reference count, it's always owned by the device, never by the page > allocator. ZONE_DEVICE overrides the ->lru list_head to store private > device information and we rely on the behavior that a non-zero > reference means the page is not added to any lru or page cache list. So, what do you propose? Use get_page() instead of page_cache_get_speculative() in GUP_fast() if the page belong to zone device? I don't like it. This situation, when we only can use subset of helpers to manipulate page refcount creates situation waiting to explode. I think it's still better to do it on page_ref_* level. BTW, why do we need to pin pgmap from get_page() in first place? I don't have enough background in ZONE_DEVICE. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: get_zone_device_page() in get_page() and page_cache_get_speculative() 2017-04-24 18:01 ` Kirill A. Shutemov @ 2017-04-24 18:25 ` Kirill A. Shutemov 2017-04-24 18:41 ` Dan Williams 0 siblings, 1 reply; 43+ messages in thread From: Kirill A. Shutemov @ 2017-04-24 18:25 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dan Williams, Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar, Andrew Morton, H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits On Mon, Apr 24, 2017 at 09:01:58PM +0300, Kirill A. Shutemov wrote: > On Mon, Apr 24, 2017 at 10:47:43AM -0700, Dan Williams wrote: > I think it's still better to do it on page_ref_* level. Something like patch below? What do you think? diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 93416196ba64..bd1b13af4567 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -35,20 +35,6 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start) } #endif -/** - * struct dev_pagemap - metadata for ZONE_DEVICE mappings - * @altmap: pre-allocated/reserved memory for vmemmap allocations - * @res: physical address range covered by @ref - * @ref: reference count that pins the devm_memremap_pages() mapping - * @dev: host device of the mapping for debug - */ -struct dev_pagemap { - struct vmem_altmap *altmap; - const struct resource *res; - struct percpu_ref *ref; - struct device *dev; -}; - #ifdef CONFIG_ZONE_DEVICE void *devm_memremap_pages(struct device *dev, struct resource *res, struct percpu_ref *ref, struct vmem_altmap *altmap); diff --git a/include/linux/mm.h b/include/linux/mm.h index e197d3ca3e8a..c2749b878199 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -760,19 +760,11 @@ static inline enum zone_type page_zonenum(const struct page *page) } #ifdef CONFIG_ZONE_DEVICE -void get_zone_device_page(struct page *page); -void put_zone_device_page(struct page *page); static inline bool is_zone_device_page(const struct page *page) { return page_zonenum(page) == ZONE_DEVICE; } #else -static inline void get_zone_device_page(struct page *page) -{ -} -static inline void put_zone_device_page(struct page *page) -{ -} static inline bool is_zone_device_page(const struct page *page) { return false; @@ -788,9 +780,6 @@ static inline void get_page(struct page *page) */ VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page); page_ref_inc(page); - - if (unlikely(is_zone_device_page(page))) - get_zone_device_page(page); } static inline void put_page(struct page *page) @@ -799,9 +788,6 @@ static inline void put_page(struct page *page) if (put_page_testzero(page)) __put_page(page); - - if (unlikely(is_zone_device_page(page))) - put_zone_device_page(page); } #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 45cdb27791a3..fb7bb60d446b 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -601,4 +601,18 @@ typedef struct { unsigned long val; } swp_entry_t; +/** + * struct dev_pagemap - metadata for ZONE_DEVICE mappings + * @altmap: pre-allocated/reserved memory for vmemmap allocations + * @res: physical address range covered by @ref + * @ref: reference count that pins the devm_memremap_pages() mapping + * @dev: host device of the mapping for debug + */ +struct dev_pagemap { + struct vmem_altmap *altmap; + const struct resource *res; + struct percpu_ref *ref; + struct device *dev; +}; + #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 610e13271918..d834c68e21fd 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -61,6 +61,8 @@ static inline void __page_ref_unfreeze(struct page *page, int v) #endif +static inline bool is_zone_device_page(const struct page *page); + static inline int page_ref_count(struct page *page) { return atomic_read(&page->_refcount); @@ -92,6 +94,9 @@ static inline void page_ref_add(struct page *page, int nr) atomic_add(nr, &page->_refcount); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) __page_ref_mod(page, nr); + + if (unlikely(is_zone_device_page(page))) + percpu_ref_get_many(page->pgmap->ref, nr); } static inline void page_ref_sub(struct page *page, int nr) @@ -99,6 +104,9 @@ static inline void page_ref_sub(struct page *page, int nr) atomic_sub(nr, &page->_refcount); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) __page_ref_mod(page, -nr); + + if (unlikely(is_zone_device_page(page))) + percpu_ref_put_many(page->pgmap->ref, nr); } static inline void page_ref_inc(struct page *page) @@ -106,6 +114,9 @@ static inline void page_ref_inc(struct page *page) atomic_inc(&page->_refcount); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) __page_ref_mod(page, 1); + + if (unlikely(is_zone_device_page(page))) + percpu_ref_get(page->pgmap->ref); } static inline void page_ref_dec(struct page *page) @@ -113,6 +124,9 @@ static inline void page_ref_dec(struct page *page) atomic_dec(&page->_refcount); if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) __page_ref_mod(page, -1); + + if (unlikely(is_zone_device_page(page))) + percpu_ref_put(page->pgmap->ref); } static inline int page_ref_sub_and_test(struct page *page, int nr) @@ -121,6 +135,9 @@ static inline int page_ref_sub_and_test(struct page *page, int nr) if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_test)) __page_ref_mod_and_test(page, -nr, ret); + + if (unlikely(is_zone_device_page(page))) + percpu_ref_put_many(page->pgmap->ref, nr); return ret; } @@ -130,6 +147,9 @@ static inline int page_ref_inc_return(struct page *page) if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_return)) __page_ref_mod_and_return(page, 1, ret); + + if (unlikely(is_zone_device_page(page))) + percpu_ref_get(page->pgmap->ref); return ret; } @@ -139,6 +159,9 @@ static inline int page_ref_dec_and_test(struct page *page) if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_test)) __page_ref_mod_and_test(page, -1, ret); + + if (unlikely(is_zone_device_page(page))) + percpu_ref_put(page->pgmap->ref); return ret; } @@ -148,6 +171,9 @@ static inline int page_ref_dec_return(struct page *page) if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_return)) __page_ref_mod_and_return(page, -1, ret); + + if (unlikely(is_zone_device_page(page))) + percpu_ref_put(page->pgmap->ref); return ret; } @@ -157,6 +183,9 @@ static inline int page_ref_add_unless(struct page *page, int nr, int u) if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_unless)) __page_ref_mod_unless(page, nr, ret); + + if (unlikely(is_zone_device_page(page)) && ret) + percpu_ref_get_many(page->pgmap->ref, nr); return ret; } @@ -166,6 +195,9 @@ static inline int page_ref_freeze(struct page *page, int count) if (page_ref_tracepoint_active(__tracepoint_page_ref_freeze)) __page_ref_freeze(page, count, ret); + + if (unlikely(is_zone_device_page(page)) && ret) + percpu_ref_put_many(page->pgmap->ref, count); return ret; } @@ -177,6 +209,9 @@ static inline void page_ref_unfreeze(struct page *page, int count) atomic_set(&page->_refcount, count); if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze)) __page_ref_unfreeze(page, count); + + if (unlikely(is_zone_device_page(page))) + percpu_ref_get_many(page->pgmap->ref, count); } #endif diff --git a/kernel/memremap.c b/kernel/memremap.c index 06123234f118..936cef79d811 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -182,18 +182,6 @@ struct page_map { struct vmem_altmap altmap; }; -void get_zone_device_page(struct page *page) -{ - percpu_ref_get(page->pgmap->ref); -} -EXPORT_SYMBOL(get_zone_device_page); - -void put_zone_device_page(struct page *page) -{ - put_dev_pagemap(page->pgmap); -} -EXPORT_SYMBOL(put_zone_device_page); - static void pgmap_radix_release(struct resource *res) { resource_size_t key, align_start, align_size, align_end; -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: get_zone_device_page() in get_page() and page_cache_get_speculative() 2017-04-24 18:25 ` Kirill A. Shutemov @ 2017-04-24 18:41 ` Dan Williams 2017-04-25 13:19 ` Kirill A. Shutemov 0 siblings, 1 reply; 43+ messages in thread From: Dan Williams @ 2017-04-24 18:41 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Kirill A. Shutemov, Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar, Andrew Morton, H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits On Mon, Apr 24, 2017 at 11:25 AM, Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > On Mon, Apr 24, 2017 at 09:01:58PM +0300, Kirill A. Shutemov wrote: >> On Mon, Apr 24, 2017 at 10:47:43AM -0700, Dan Williams wrote: >> I think it's still better to do it on page_ref_* level. > > Something like patch below? What do you think? >From a quick glance, I think this looks like the right way to go. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: get_zone_device_page() in get_page() and page_cache_get_speculative() 2017-04-24 18:41 ` Dan Williams @ 2017-04-25 13:19 ` Kirill A. Shutemov 2017-04-25 16:44 ` Dan Williams 0 siblings, 1 reply; 43+ messages in thread From: Kirill A. Shutemov @ 2017-04-25 13:19 UTC (permalink / raw) To: Dan Williams Cc: Kirill A. Shutemov, Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar, Andrew Morton, H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits On Mon, Apr 24, 2017 at 11:41:51AM -0700, Dan Williams wrote: > On Mon, Apr 24, 2017 at 11:25 AM, Kirill A. Shutemov > <kirill.shutemov@linux.intel.com> wrote: > > On Mon, Apr 24, 2017 at 09:01:58PM +0300, Kirill A. Shutemov wrote: > >> On Mon, Apr 24, 2017 at 10:47:43AM -0700, Dan Williams wrote: > >> I think it's still better to do it on page_ref_* level. > > > > Something like patch below? What do you think? > > From a quick glance, I think this looks like the right way to go. Okay, but I still would like to remove manipulation with pgmap->ref from hot path. Can we just check that page_count() match our expectation on devm_memremap_pages_release() instead of this? I probably miss something in bigger picture, but would something like patch work too? It seems work for the test case. diff --git a/include/linux/mm.h b/include/linux/mm.h index a835edd2db34..695da2a19b4c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct page *page) } #ifdef CONFIG_ZONE_DEVICE -void get_zone_device_page(struct page *page); -void put_zone_device_page(struct page *page); static inline bool is_zone_device_page(const struct page *page) { return page_zonenum(page) == ZONE_DEVICE; } #else -static inline void get_zone_device_page(struct page *page) -{ -} -static inline void put_zone_device_page(struct page *page) -{ -} static inline bool is_zone_device_page(const struct page *page) { return false; @@ -790,9 +782,6 @@ static inline void get_page(struct page *page) */ VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page); page_ref_inc(page); - - if (unlikely(is_zone_device_page(page))) - get_zone_device_page(page); } static inline void put_page(struct page *page) @@ -801,9 +790,6 @@ static inline void put_page(struct page *page) if (put_page_testzero(page)) __put_page(page); - - if (unlikely(is_zone_device_page(page))) - put_zone_device_page(page); } #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) diff --git a/kernel/memremap.c b/kernel/memremap.c index 07e85e5229da..e542bb2f7ab0 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -182,18 +182,6 @@ struct page_map { struct vmem_altmap altmap; }; -void get_zone_device_page(struct page *page) -{ - percpu_ref_get(page->pgmap->ref); -} -EXPORT_SYMBOL(get_zone_device_page); - -void put_zone_device_page(struct page *page) -{ - put_dev_pagemap(page->pgmap); -} -EXPORT_SYMBOL(put_zone_device_page); - static void pgmap_radix_release(struct resource *res) { resource_size_t key, align_start, align_size, align_end; @@ -237,12 +225,21 @@ static void devm_memremap_pages_release(struct device *dev, void *data) struct resource *res = &page_map->res; resource_size_t align_start, align_size; struct dev_pagemap *pgmap = &page_map->pgmap; + unsigned long pfn; if (percpu_ref_tryget_live(pgmap->ref)) { dev_WARN(dev, "%s: page mapping is still live!\n", __func__); percpu_ref_put(pgmap->ref); } + for_each_device_pfn(pfn, page_map) { + struct page *page = pfn_to_page(pfn); + + dev_WARN_ONCE(dev, page_count(page) != 1, + "%s: unexpected page count: %d!\n", + __func__, page_count(page)); + } + /* pages are dead and unused, undo the arch mapping */ align_start = res->start & ~(SECTION_SIZE - 1); align_size = ALIGN(resource_size(res), SECTION_SIZE); -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: get_zone_device_page() in get_page() and page_cache_get_speculative() 2017-04-25 13:19 ` Kirill A. Shutemov @ 2017-04-25 16:44 ` Dan Williams 0 siblings, 0 replies; 43+ messages in thread From: Dan Williams @ 2017-04-25 16:44 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Kirill A. Shutemov, Linux MM, Catalin Marinas, Aneesh Kumar K.V, Steve Capper, Thomas Gleixner, Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar, Andrew Morton, H. Peter Anvin, Dave Hansen, Borislav Petkov, Rik van Riel, Dann Frazier, Linus Torvalds, Michal Hocko, linux-tip-commits On Tue, Apr 25, 2017 at 6:19 AM, Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > On Mon, Apr 24, 2017 at 11:41:51AM -0700, Dan Williams wrote: >> On Mon, Apr 24, 2017 at 11:25 AM, Kirill A. Shutemov >> <kirill.shutemov@linux.intel.com> wrote: >> > On Mon, Apr 24, 2017 at 09:01:58PM +0300, Kirill A. Shutemov wrote: >> >> On Mon, Apr 24, 2017 at 10:47:43AM -0700, Dan Williams wrote: >> >> I think it's still better to do it on page_ref_* level. >> > >> > Something like patch below? What do you think? >> >> From a quick glance, I think this looks like the right way to go. > > Okay, but I still would like to remove manipulation with pgmap->ref from > hot path. > > Can we just check that page_count() match our expectation on > devm_memremap_pages_release() instead of this? > > I probably miss something in bigger picture, but would something like > patch work too? It seems work for the test case. No, unfortunately this is broken. It should be perfectly legal to start the driver shutdown process while page references are still outstanding. We use the percpu-ref infrastructure to wait for those references to be dropped. With the approach below we'll just race and crash. > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index a835edd2db34..695da2a19b4c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct page *page) > } > > #ifdef CONFIG_ZONE_DEVICE > -void get_zone_device_page(struct page *page); > -void put_zone_device_page(struct page *page); > static inline bool is_zone_device_page(const struct page *page) > { > return page_zonenum(page) == ZONE_DEVICE; > } > #else > -static inline void get_zone_device_page(struct page *page) > -{ > -} > -static inline void put_zone_device_page(struct page *page) > -{ > -} > static inline bool is_zone_device_page(const struct page *page) > { > return false; > @@ -790,9 +782,6 @@ static inline void get_page(struct page *page) > */ > VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page); > page_ref_inc(page); > - > - if (unlikely(is_zone_device_page(page))) > - get_zone_device_page(page); > } > > static inline void put_page(struct page *page) > @@ -801,9 +790,6 @@ static inline void put_page(struct page *page) > > if (put_page_testzero(page)) > __put_page(page); > - > - if (unlikely(is_zone_device_page(page))) > - put_zone_device_page(page); > } > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > diff --git a/kernel/memremap.c b/kernel/memremap.c > index 07e85e5229da..e542bb2f7ab0 100644 > --- a/kernel/memremap.c > +++ b/kernel/memremap.c > @@ -182,18 +182,6 @@ struct page_map { > struct vmem_altmap altmap; > }; > > -void get_zone_device_page(struct page *page) > -{ > - percpu_ref_get(page->pgmap->ref); > -} > -EXPORT_SYMBOL(get_zone_device_page); > - > -void put_zone_device_page(struct page *page) > -{ > - put_dev_pagemap(page->pgmap); > -} > -EXPORT_SYMBOL(put_zone_device_page); > - > static void pgmap_radix_release(struct resource *res) > { > resource_size_t key, align_start, align_size, align_end; > @@ -237,12 +225,21 @@ static void devm_memremap_pages_release(struct device *dev, void *data) > struct resource *res = &page_map->res; > resource_size_t align_start, align_size; > struct dev_pagemap *pgmap = &page_map->pgmap; > + unsigned long pfn; > > if (percpu_ref_tryget_live(pgmap->ref)) { > dev_WARN(dev, "%s: page mapping is still live!\n", __func__); > percpu_ref_put(pgmap->ref); > } > > + for_each_device_pfn(pfn, page_map) { > + struct page *page = pfn_to_page(pfn); > + > + dev_WARN_ONCE(dev, page_count(page) != 1, > + "%s: unexpected page count: %d!\n", > + __func__, page_count(page)); > + } > + > /* pages are dead and unused, undo the arch mapping */ > align_start = res->start & ~(SECTION_SIZE - 1); > align_size = ALIGN(resource_size(res), SECTION_SIZE); > -- > Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-23 23:31 ` get_zone_device_page() in get_page() and page_cache_get_speculative() Kirill A. Shutemov 2017-04-24 17:23 ` Dan Williams @ 2017-04-27 0:55 ` Dan Williams 2017-04-27 8:33 ` Kirill A. Shutemov 2017-04-27 16:11 ` [PATCH] " Logan Gunthorpe 1 sibling, 2 replies; 43+ messages in thread From: Dan Williams @ 2017-04-27 0:55 UTC (permalink / raw) To: akpm Cc: linux-mm, Jérôme Glisse, Logan Gunthorpe, linux-kernel, Kirill Shutemov Kirill points out that the calls to {get,put}_dev_pagemap() can be removed from the mm fast path if we take a single get_dev_pagemap() reference to signify that the page is alive and use the final put of the page to drop that reference. This does require some care to make sure that any waits for the percpu_ref to drop to zero occur *after* devm_memremap_page_release(), since it now maintains its own elevated reference. Cc Ingo Molnar <mingo@redhat.com> Cc: JA(C)rA'me Glisse <jglisse@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Andrew Morton <akpm@linux-foundation.org> Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- This patch might fix the regression that we found with the conversion to generic get_user_pages_fast() in the x86/mm branch pending for 4.12 (commit 2947ba054a4d "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"). I'll test tomorrow, but in case someone can give it a try before I wake up, here's an early version. drivers/dax/pmem.c | 2 +- drivers/nvdimm/pmem.c | 13 +++++++++++-- include/linux/mm.h | 14 -------------- kernel/memremap.c | 22 +++++++++------------- mm/swap.c | 10 ++++++++++ 5 files changed, 31 insertions(+), 30 deletions(-) diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index d4ca19bd74eb..9f2a0b4fd801 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data) struct dax_pmem *dax_pmem = to_dax_pmem(ref); dev_dbg(dax_pmem->dev, "%s\n", __func__); + wait_for_completion(&dax_pmem->cmp); percpu_ref_exit(ref); } @@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data) dev_dbg(dax_pmem->dev, "%s\n", __func__); percpu_ref_kill(ref); - wait_for_completion(&dax_pmem->cmp); } static int dax_pmem_probe(struct device *dev) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 3b3dab73d741..6be0c1253fcd 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -25,6 +25,7 @@ #include <linux/badblocks.h> #include <linux/memremap.h> #include <linux/vmalloc.h> +#include <linux/blk-mq.h> #include <linux/pfn_t.h> #include <linux/slab.h> #include <linux/pmem.h> @@ -243,6 +244,11 @@ static void pmem_release_queue(void *q) blk_cleanup_queue(q); } +static void pmem_freeze_queue(void *q) +{ + blk_mq_freeze_queue_start(q); +} + static void pmem_release_disk(void *__pmem) { struct pmem_device *pmem = __pmem; @@ -301,6 +307,9 @@ static int pmem_attach_disk(struct device *dev, if (!q) return -ENOMEM; + if (devm_add_action_or_reset(dev, pmem_release_queue, q)) + return -ENOMEM; + pmem->pfn_flags = PFN_DEV; if (is_nd_pfn(dev)) { addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter, @@ -320,10 +329,10 @@ static int pmem_attach_disk(struct device *dev, pmem->size, ARCH_MEMREMAP_PMEM); /* - * At release time the queue must be dead before + * At release time the queue must be frozen before * devm_memremap_pages is unwound */ - if (devm_add_action_or_reset(dev, pmem_release_queue, q)) + if (devm_add_action_or_reset(dev, pmem_freeze_queue, q)) return -ENOMEM; if (IS_ERR(addr)) diff --git a/include/linux/mm.h b/include/linux/mm.h index 00a8fa7e366a..ce17b35257ac 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -758,19 +758,11 @@ static inline enum zone_type page_zonenum(const struct page *page) } #ifdef CONFIG_ZONE_DEVICE -void get_zone_device_page(struct page *page); -void put_zone_device_page(struct page *page); static inline bool is_zone_device_page(const struct page *page) { return page_zonenum(page) == ZONE_DEVICE; } #else -static inline void get_zone_device_page(struct page *page) -{ -} -static inline void put_zone_device_page(struct page *page) -{ -} static inline bool is_zone_device_page(const struct page *page) { return false; @@ -786,9 +778,6 @@ static inline void get_page(struct page *page) */ VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page); page_ref_inc(page); - - if (unlikely(is_zone_device_page(page))) - get_zone_device_page(page); } static inline void put_page(struct page *page) @@ -797,9 +786,6 @@ static inline void put_page(struct page *page) if (put_page_testzero(page)) __put_page(page); - - if (unlikely(is_zone_device_page(page))) - put_zone_device_page(page); } #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) diff --git a/kernel/memremap.c b/kernel/memremap.c index 07e85e5229da..5316efdde083 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -182,18 +182,6 @@ struct page_map { struct vmem_altmap altmap; }; -void get_zone_device_page(struct page *page) -{ - percpu_ref_get(page->pgmap->ref); -} -EXPORT_SYMBOL(get_zone_device_page); - -void put_zone_device_page(struct page *page) -{ - put_dev_pagemap(page->pgmap); -} -EXPORT_SYMBOL(put_zone_device_page); - static void pgmap_radix_release(struct resource *res) { resource_size_t key, align_start, align_size, align_end; @@ -237,6 +225,10 @@ static void devm_memremap_pages_release(struct device *dev, void *data) struct resource *res = &page_map->res; resource_size_t align_start, align_size; struct dev_pagemap *pgmap = &page_map->pgmap; + unsigned long pfn; + + for_each_device_pfn(pfn, page_map) + put_page(pfn_to_page(pfn)); if (percpu_ref_tryget_live(pgmap->ref)) { dev_WARN(dev, "%s: page mapping is still live!\n", __func__); @@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys) * * Notes: * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time - * (or devm release event). + * (or devm release event). The expected order of events is that @ref has + * been through percpu_ref_kill() before devm_memremap_pages_release(). The + * wait for the completion of kill and percpu_ref_exit() must occur after + * devm_memremap_pages_release(). * * 2/ @res is expected to be a host memory range that could feasibly be * treated as a "System RAM" range, i.e. not a device mmio range, but @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, */ list_del(&page->lru); page->pgmap = pgmap; + percpu_ref_get(ref); } devres_add(dev, page_map); return __va(res->start); diff --git a/mm/swap.c b/mm/swap.c index 5dabf444d724..01267dda6668 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page) void __put_page(struct page *page) { + if (is_zone_device_page(page)) { + put_dev_pagemap(page->pgmap); + + /* + * The page belong to device, do not return it to + * page allocator. + */ + return; + } + if (unlikely(PageCompound(page))) __put_compound_page(page); else -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-27 0:55 ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Dan Williams @ 2017-04-27 8:33 ` Kirill A. Shutemov 2017-04-28 6:39 ` Ingo Molnar 2017-04-27 16:11 ` [PATCH] " Logan Gunthorpe 1 sibling, 1 reply; 43+ messages in thread From: Kirill A. Shutemov @ 2017-04-27 8:33 UTC (permalink / raw) To: Dan Williams, Ingo Molnar Cc: akpm, linux-mm, Jérôme Glisse, Logan Gunthorpe, linux-kernel, Kirill Shutemov On Wed, Apr 26, 2017 at 05:55:31PM -0700, Dan Williams wrote: > Kirill points out that the calls to {get,put}_dev_pagemap() can be > removed from the mm fast path if we take a single get_dev_pagemap() > reference to signify that the page is alive and use the final put of the > page to drop that reference. > > This does require some care to make sure that any waits for the > percpu_ref to drop to zero occur *after* devm_memremap_page_release(), > since it now maintains its own elevated reference. > > Cc Ingo Molnar <mingo@redhat.com> > Cc: Jerome Glisse <jglisse@redhat.com> > Cc: Logan Gunthorpe <logang@deltatee.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > > This patch might fix the regression that we found with the conversion to > generic get_user_pages_fast() in the x86/mm branch pending for 4.12 > (commit 2947ba054a4d "x86/mm/gup: Switch GUP to the generic > get_user_page_fast() implementation"). I'll test tomorrow, but in case > someone can give it a try before I wake up, here's an early version. + Ingo. This works for me with and without GUP revert. Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > drivers/dax/pmem.c | 2 +- > drivers/nvdimm/pmem.c | 13 +++++++++++-- There's a trivial conflict in drivers/nvdimm/pmem.c when applied to tip/master. > include/linux/mm.h | 14 -------------- > kernel/memremap.c | 22 +++++++++------------- > mm/swap.c | 10 ++++++++++ > 5 files changed, 31 insertions(+), 30 deletions(-) > > diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c > index d4ca19bd74eb..9f2a0b4fd801 100644 > --- a/drivers/dax/pmem.c > +++ b/drivers/dax/pmem.c > @@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data) > struct dax_pmem *dax_pmem = to_dax_pmem(ref); > > dev_dbg(dax_pmem->dev, "%s\n", __func__); > + wait_for_completion(&dax_pmem->cmp); > percpu_ref_exit(ref); > } > > @@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data) > > dev_dbg(dax_pmem->dev, "%s\n", __func__); > percpu_ref_kill(ref); > - wait_for_completion(&dax_pmem->cmp); > } > > static int dax_pmem_probe(struct device *dev) > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 3b3dab73d741..6be0c1253fcd 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -25,6 +25,7 @@ > #include <linux/badblocks.h> > #include <linux/memremap.h> > #include <linux/vmalloc.h> > +#include <linux/blk-mq.h> > #include <linux/pfn_t.h> > #include <linux/slab.h> > #include <linux/pmem.h> > @@ -243,6 +244,11 @@ static void pmem_release_queue(void *q) > blk_cleanup_queue(q); > } > > +static void pmem_freeze_queue(void *q) > +{ > + blk_mq_freeze_queue_start(q); > +} > + > static void pmem_release_disk(void *__pmem) > { > struct pmem_device *pmem = __pmem; > @@ -301,6 +307,9 @@ static int pmem_attach_disk(struct device *dev, > if (!q) > return -ENOMEM; > > + if (devm_add_action_or_reset(dev, pmem_release_queue, q)) > + return -ENOMEM; > + > pmem->pfn_flags = PFN_DEV; > if (is_nd_pfn(dev)) { > addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter, > @@ -320,10 +329,10 @@ static int pmem_attach_disk(struct device *dev, > pmem->size, ARCH_MEMREMAP_PMEM); > > /* > - * At release time the queue must be dead before > + * At release time the queue must be frozen before > * devm_memremap_pages is unwound > */ > - if (devm_add_action_or_reset(dev, pmem_release_queue, q)) > + if (devm_add_action_or_reset(dev, pmem_freeze_queue, q)) > return -ENOMEM; > > if (IS_ERR(addr)) > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 00a8fa7e366a..ce17b35257ac 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -758,19 +758,11 @@ static inline enum zone_type page_zonenum(const struct page *page) > } > > #ifdef CONFIG_ZONE_DEVICE > -void get_zone_device_page(struct page *page); > -void put_zone_device_page(struct page *page); > static inline bool is_zone_device_page(const struct page *page) > { > return page_zonenum(page) == ZONE_DEVICE; > } > #else > -static inline void get_zone_device_page(struct page *page) > -{ > -} > -static inline void put_zone_device_page(struct page *page) > -{ > -} > static inline bool is_zone_device_page(const struct page *page) > { > return false; > @@ -786,9 +778,6 @@ static inline void get_page(struct page *page) > */ > VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page); > page_ref_inc(page); > - > - if (unlikely(is_zone_device_page(page))) > - get_zone_device_page(page); > } > > static inline void put_page(struct page *page) > @@ -797,9 +786,6 @@ static inline void put_page(struct page *page) > > if (put_page_testzero(page)) > __put_page(page); > - > - if (unlikely(is_zone_device_page(page))) > - put_zone_device_page(page); > } > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > diff --git a/kernel/memremap.c b/kernel/memremap.c > index 07e85e5229da..5316efdde083 100644 > --- a/kernel/memremap.c > +++ b/kernel/memremap.c > @@ -182,18 +182,6 @@ struct page_map { > struct vmem_altmap altmap; > }; > > -void get_zone_device_page(struct page *page) > -{ > - percpu_ref_get(page->pgmap->ref); > -} > -EXPORT_SYMBOL(get_zone_device_page); > - > -void put_zone_device_page(struct page *page) > -{ > - put_dev_pagemap(page->pgmap); > -} > -EXPORT_SYMBOL(put_zone_device_page); > - > static void pgmap_radix_release(struct resource *res) > { > resource_size_t key, align_start, align_size, align_end; > @@ -237,6 +225,10 @@ static void devm_memremap_pages_release(struct device *dev, void *data) > struct resource *res = &page_map->res; > resource_size_t align_start, align_size; > struct dev_pagemap *pgmap = &page_map->pgmap; > + unsigned long pfn; > + > + for_each_device_pfn(pfn, page_map) > + put_page(pfn_to_page(pfn)); > > if (percpu_ref_tryget_live(pgmap->ref)) { > dev_WARN(dev, "%s: page mapping is still live!\n", __func__); > @@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys) > * > * Notes: > * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time > - * (or devm release event). > + * (or devm release event). The expected order of events is that @ref has > + * been through percpu_ref_kill() before devm_memremap_pages_release(). The > + * wait for the completion of kill and percpu_ref_exit() must occur after > + * devm_memremap_pages_release(). > * > * 2/ @res is expected to be a host memory range that could feasibly be > * treated as a "System RAM" range, i.e. not a device mmio range, but > @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, > */ > list_del(&page->lru); > page->pgmap = pgmap; > + percpu_ref_get(ref); > } > devres_add(dev, page_map); > return __va(res->start); > diff --git a/mm/swap.c b/mm/swap.c > index 5dabf444d724..01267dda6668 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page) > > void __put_page(struct page *page) > { > + if (is_zone_device_page(page)) { > + put_dev_pagemap(page->pgmap); > + > + /* > + * The page belong to device, do not return it to > + * page allocator. > + */ > + return; > + } > + > if (unlikely(PageCompound(page))) > __put_compound_page(page); > else > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-27 8:33 ` Kirill A. Shutemov @ 2017-04-28 6:39 ` Ingo Molnar 2017-04-28 8:14 ` [PATCH] mm, zone_device: Replace " Kirill A. Shutemov 2017-04-28 17:23 ` [PATCH v2] mm, zone_device: replace " Dan Williams 0 siblings, 2 replies; 43+ messages in thread From: Ingo Molnar @ 2017-04-28 6:39 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dan Williams, Ingo Molnar, akpm, linux-mm, Jérôme Glisse, Logan Gunthorpe, linux-kernel, Kirill Shutemov * Kirill A. Shutemov <kirill@shutemov.name> wrote: > On Wed, Apr 26, 2017 at 05:55:31PM -0700, Dan Williams wrote: > > Kirill points out that the calls to {get,put}_dev_pagemap() can be > > removed from the mm fast path if we take a single get_dev_pagemap() > > reference to signify that the page is alive and use the final put of the > > page to drop that reference. > > > > This does require some care to make sure that any waits for the > > percpu_ref to drop to zero occur *after* devm_memremap_page_release(), > > since it now maintains its own elevated reference. > > > > Cc Ingo Molnar <mingo@redhat.com> > > Cc: Jerome Glisse <jglisse@redhat.com> > > Cc: Logan Gunthorpe <logang@deltatee.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > > --- > > > > This patch might fix the regression that we found with the conversion to > > generic get_user_pages_fast() in the x86/mm branch pending for 4.12 > > (commit 2947ba054a4d "x86/mm/gup: Switch GUP to the generic > > get_user_page_fast() implementation"). I'll test tomorrow, but in case > > someone can give it a try before I wake up, here's an early version. > > + Ingo. > > This works for me with and without GUP revert. > > Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > > drivers/dax/pmem.c | 2 +- > > drivers/nvdimm/pmem.c | 13 +++++++++++-- > > There's a trivial conflict in drivers/nvdimm/pmem.c when applied to > tip/master. Ok, could someone please send a version either to Linus for v4.11, or a version against latest -tip so I can included it in x86/mm, so that x86/mm gets unbroken. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH] mm, zone_device: Replace {get, put}_zone_device_page() with a single reference 2017-04-28 6:39 ` Ingo Molnar @ 2017-04-28 8:14 ` Kirill A. Shutemov 2017-04-28 17:23 ` [PATCH v2] mm, zone_device: replace " Dan Williams 1 sibling, 0 replies; 43+ messages in thread From: Kirill A. Shutemov @ 2017-04-28 8:14 UTC (permalink / raw) To: Ingo Molnar, akpm Cc: Dan Williams, Jérôme Glisse, Logan Gunthorpe, linux-mm, linux-kernel, Kirill A . Shutemov From: Dan Williams <dan.j.williams@intel.com> Kirill points out that the calls to {get,put}_dev_pagemap() can be removed from the mm fast path if we take a single get_dev_pagemap() reference to signify that the page is alive and use the final put of the page to drop that reference. This does require some care to make sure that any waits for the percpu_ref to drop to zero occur *after* devm_memremap_page_release(), since it now maintains its own elevated reference. Cc Ingo Molnar <mingo@redhat.com> Cc: JA(C)rA'me Glisse <jglisse@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Andrew Morton <akpm@linux-foundation.org> Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> [kirill.shutemov@linux.intel.com: rebased to -tip] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- drivers/dax/pmem.c | 2 +- drivers/nvdimm/pmem.c | 13 +++++++++++-- include/linux/mm.h | 14 -------------- kernel/memremap.c | 22 +++++++++------------- mm/swap.c | 10 ++++++++++ 5 files changed, 31 insertions(+), 30 deletions(-) diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index 033f49b31fdc..cb0d742fa23f 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data) struct dax_pmem *dax_pmem = to_dax_pmem(ref); dev_dbg(dax_pmem->dev, "%s\n", __func__); + wait_for_completion(&dax_pmem->cmp); percpu_ref_exit(ref); } @@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data) dev_dbg(dax_pmem->dev, "%s\n", __func__); percpu_ref_kill(ref); - wait_for_completion(&dax_pmem->cmp); } static int dax_pmem_probe(struct device *dev) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 5b536be5a12e..fb7bbc79ac26 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -25,6 +25,7 @@ #include <linux/badblocks.h> #include <linux/memremap.h> #include <linux/vmalloc.h> +#include <linux/blk-mq.h> #include <linux/pfn_t.h> #include <linux/slab.h> #include <linux/pmem.h> @@ -231,6 +232,11 @@ static void pmem_release_queue(void *q) blk_cleanup_queue(q); } +static void pmem_freeze_queue(void *q) +{ + blk_mq_freeze_queue_start(q); +} + static void pmem_release_disk(void *disk) { del_gendisk(disk); @@ -284,6 +290,9 @@ static int pmem_attach_disk(struct device *dev, if (!q) return -ENOMEM; + if (devm_add_action_or_reset(dev, pmem_release_queue, q)) + return -ENOMEM; + pmem->pfn_flags = PFN_DEV; if (is_nd_pfn(dev)) { addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter, @@ -303,10 +312,10 @@ static int pmem_attach_disk(struct device *dev, pmem->size, ARCH_MEMREMAP_PMEM); /* - * At release time the queue must be dead before + * At release time the queue must be frozen before * devm_memremap_pages is unwound */ - if (devm_add_action_or_reset(dev, pmem_release_queue, q)) + if (devm_add_action_or_reset(dev, pmem_freeze_queue, q)) return -ENOMEM; if (IS_ERR(addr)) diff --git a/include/linux/mm.h b/include/linux/mm.h index a835edd2db34..695da2a19b4c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct page *page) } #ifdef CONFIG_ZONE_DEVICE -void get_zone_device_page(struct page *page); -void put_zone_device_page(struct page *page); static inline bool is_zone_device_page(const struct page *page) { return page_zonenum(page) == ZONE_DEVICE; } #else -static inline void get_zone_device_page(struct page *page) -{ -} -static inline void put_zone_device_page(struct page *page) -{ -} static inline bool is_zone_device_page(const struct page *page) { return false; @@ -790,9 +782,6 @@ static inline void get_page(struct page *page) */ VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page); page_ref_inc(page); - - if (unlikely(is_zone_device_page(page))) - get_zone_device_page(page); } static inline void put_page(struct page *page) @@ -801,9 +790,6 @@ static inline void put_page(struct page *page) if (put_page_testzero(page)) __put_page(page); - - if (unlikely(is_zone_device_page(page))) - put_zone_device_page(page); } #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) diff --git a/kernel/memremap.c b/kernel/memremap.c index 07e85e5229da..5316efdde083 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -182,18 +182,6 @@ struct page_map { struct vmem_altmap altmap; }; -void get_zone_device_page(struct page *page) -{ - percpu_ref_get(page->pgmap->ref); -} -EXPORT_SYMBOL(get_zone_device_page); - -void put_zone_device_page(struct page *page) -{ - put_dev_pagemap(page->pgmap); -} -EXPORT_SYMBOL(put_zone_device_page); - static void pgmap_radix_release(struct resource *res) { resource_size_t key, align_start, align_size, align_end; @@ -237,6 +225,10 @@ static void devm_memremap_pages_release(struct device *dev, void *data) struct resource *res = &page_map->res; resource_size_t align_start, align_size; struct dev_pagemap *pgmap = &page_map->pgmap; + unsigned long pfn; + + for_each_device_pfn(pfn, page_map) + put_page(pfn_to_page(pfn)); if (percpu_ref_tryget_live(pgmap->ref)) { dev_WARN(dev, "%s: page mapping is still live!\n", __func__); @@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys) * * Notes: * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time - * (or devm release event). + * (or devm release event). The expected order of events is that @ref has + * been through percpu_ref_kill() before devm_memremap_pages_release(). The + * wait for the completion of kill and percpu_ref_exit() must occur after + * devm_memremap_pages_release(). * * 2/ @res is expected to be a host memory range that could feasibly be * treated as a "System RAM" range, i.e. not a device mmio range, but @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, */ list_del(&page->lru); page->pgmap = pgmap; + percpu_ref_get(ref); } devres_add(dev, page_map); return __va(res->start); diff --git a/mm/swap.c b/mm/swap.c index 5dabf444d724..01267dda6668 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page) void __put_page(struct page *page) { + if (is_zone_device_page(page)) { + put_dev_pagemap(page->pgmap); + + /* + * The page belong to device, do not return it to + * page allocator. + */ + return; + } + if (unlikely(PageCompound(page))) __put_compound_page(page); else -- 2.11.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-28 6:39 ` Ingo Molnar 2017-04-28 8:14 ` [PATCH] mm, zone_device: Replace " Kirill A. Shutemov @ 2017-04-28 17:23 ` Dan Williams 2017-04-28 17:34 ` Jerome Glisse 2017-04-29 14:18 ` Ingo Molnar 1 sibling, 2 replies; 43+ messages in thread From: Dan Williams @ 2017-04-28 17:23 UTC (permalink / raw) To: mingo Cc: linux-kernel, linux-mm, Jérôme Glisse, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov Kirill points out that the calls to {get,put}_dev_pagemap() can be removed from the mm fast path if we take a single get_dev_pagemap() reference to signify that the page is alive and use the final put of the page to drop that reference. This does require some care to make sure that any waits for the percpu_ref to drop to zero occur *after* devm_memremap_page_release(), since it now maintains its own elevated reference. Cc: Ingo Molnar <mingo@redhat.com> Cc: JA(C)rA'me Glisse <jglisse@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- Changes in v2: * Rebased to tip/master * Clarified comment in __put_page * Clarified devm_memremap_pages() kernel doc about ordering of devm_memremap_pages_release() vs percpu_ref_kill() vs wait for percpu_ref to drop to zero. Ingo, I retested this with a revert of commit 6dd29b3df975 "Revert 'x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation'". It should be good to go through x86/mm. drivers/dax/pmem.c | 2 +- drivers/nvdimm/pmem.c | 13 +++++++++++-- include/linux/mm.h | 14 -------------- kernel/memremap.c | 22 +++++++++------------- mm/swap.c | 10 ++++++++++ 5 files changed, 31 insertions(+), 30 deletions(-) diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index 033f49b31fdc..cb0d742fa23f 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data) struct dax_pmem *dax_pmem = to_dax_pmem(ref); dev_dbg(dax_pmem->dev, "%s\n", __func__); + wait_for_completion(&dax_pmem->cmp); percpu_ref_exit(ref); } @@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data) dev_dbg(dax_pmem->dev, "%s\n", __func__); percpu_ref_kill(ref); - wait_for_completion(&dax_pmem->cmp); } static int dax_pmem_probe(struct device *dev) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 5b536be5a12e..fb7bbc79ac26 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -25,6 +25,7 @@ #include <linux/badblocks.h> #include <linux/memremap.h> #include <linux/vmalloc.h> +#include <linux/blk-mq.h> #include <linux/pfn_t.h> #include <linux/slab.h> #include <linux/pmem.h> @@ -231,6 +232,11 @@ static void pmem_release_queue(void *q) blk_cleanup_queue(q); } +static void pmem_freeze_queue(void *q) +{ + blk_mq_freeze_queue_start(q); +} + static void pmem_release_disk(void *disk) { del_gendisk(disk); @@ -284,6 +290,9 @@ static int pmem_attach_disk(struct device *dev, if (!q) return -ENOMEM; + if (devm_add_action_or_reset(dev, pmem_release_queue, q)) + return -ENOMEM; + pmem->pfn_flags = PFN_DEV; if (is_nd_pfn(dev)) { addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter, @@ -303,10 +312,10 @@ static int pmem_attach_disk(struct device *dev, pmem->size, ARCH_MEMREMAP_PMEM); /* - * At release time the queue must be dead before + * At release time the queue must be frozen before * devm_memremap_pages is unwound */ - if (devm_add_action_or_reset(dev, pmem_release_queue, q)) + if (devm_add_action_or_reset(dev, pmem_freeze_queue, q)) return -ENOMEM; if (IS_ERR(addr)) diff --git a/include/linux/mm.h b/include/linux/mm.h index a835edd2db34..695da2a19b4c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct page *page) } #ifdef CONFIG_ZONE_DEVICE -void get_zone_device_page(struct page *page); -void put_zone_device_page(struct page *page); static inline bool is_zone_device_page(const struct page *page) { return page_zonenum(page) == ZONE_DEVICE; } #else -static inline void get_zone_device_page(struct page *page) -{ -} -static inline void put_zone_device_page(struct page *page) -{ -} static inline bool is_zone_device_page(const struct page *page) { return false; @@ -790,9 +782,6 @@ static inline void get_page(struct page *page) */ VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page); page_ref_inc(page); - - if (unlikely(is_zone_device_page(page))) - get_zone_device_page(page); } static inline void put_page(struct page *page) @@ -801,9 +790,6 @@ static inline void put_page(struct page *page) if (put_page_testzero(page)) __put_page(page); - - if (unlikely(is_zone_device_page(page))) - put_zone_device_page(page); } #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) diff --git a/kernel/memremap.c b/kernel/memremap.c index 07e85e5229da..23a6483c3666 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -182,18 +182,6 @@ struct page_map { struct vmem_altmap altmap; }; -void get_zone_device_page(struct page *page) -{ - percpu_ref_get(page->pgmap->ref); -} -EXPORT_SYMBOL(get_zone_device_page); - -void put_zone_device_page(struct page *page) -{ - put_dev_pagemap(page->pgmap); -} -EXPORT_SYMBOL(put_zone_device_page); - static void pgmap_radix_release(struct resource *res) { resource_size_t key, align_start, align_size, align_end; @@ -237,6 +225,10 @@ static void devm_memremap_pages_release(struct device *dev, void *data) struct resource *res = &page_map->res; resource_size_t align_start, align_size; struct dev_pagemap *pgmap = &page_map->pgmap; + unsigned long pfn; + + for_each_device_pfn(pfn, page_map) + put_page(pfn_to_page(pfn)); if (percpu_ref_tryget_live(pgmap->ref)) { dev_WARN(dev, "%s: page mapping is still live!\n", __func__); @@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys) * * Notes: * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time - * (or devm release event). + * (or devm release event). The expected order of events is that @ref has + * been through percpu_ref_kill() before devm_memremap_pages_release(). The + * wait for the completion of all references being dropped and + * percpu_ref_exit() must occur after devm_memremap_pages_release(). * * 2/ @res is expected to be a host memory range that could feasibly be * treated as a "System RAM" range, i.e. not a device mmio range, but @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, */ list_del(&page->lru); page->pgmap = pgmap; + percpu_ref_get(ref); } devres_add(dev, page_map); return __va(res->start); diff --git a/mm/swap.c b/mm/swap.c index 5dabf444d724..d8d9ee9e311a 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page) void __put_page(struct page *page) { + if (is_zone_device_page(page)) { + put_dev_pagemap(page->pgmap); + + /* + * The page belongs to the device that created pgmap. Do + * not return it to page allocator. + */ + return; + } + if (unlikely(PageCompound(page))) __put_compound_page(page); else -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-28 17:23 ` [PATCH v2] mm, zone_device: replace " Dan Williams @ 2017-04-28 17:34 ` Jerome Glisse 2017-04-28 17:41 ` Dan Williams 2017-04-29 14:18 ` Ingo Molnar 1 sibling, 1 reply; 43+ messages in thread From: Jerome Glisse @ 2017-04-28 17:34 UTC (permalink / raw) To: Dan Williams Cc: mingo, linux-kernel, linux-mm, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov > Kirill points out that the calls to {get,put}_dev_pagemap() can be > removed from the mm fast path if we take a single get_dev_pagemap() > reference to signify that the page is alive and use the final put of the > page to drop that reference. > > This does require some care to make sure that any waits for the > percpu_ref to drop to zero occur *after* devm_memremap_page_release(), > since it now maintains its own elevated reference. This is NAK from HMM point of view as i need those call. So if you remove them now i will need to add them back as part of HMM. Cheers, Jérôme > > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Jérôme Glisse <jglisse@redhat.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Reviewed-by: Logan Gunthorpe <logang@deltatee.com> > Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> > Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > Changes in v2: > * Rebased to tip/master > * Clarified comment in __put_page > * Clarified devm_memremap_pages() kernel doc about ordering of > devm_memremap_pages_release() vs percpu_ref_kill() vs wait for > percpu_ref to drop to zero. > > Ingo, I retested this with a revert of commit 6dd29b3df975 "Revert > 'x86/mm/gup: Switch GUP to the generic get_user_page_fast() > implementation'". It should be good to go through x86/mm. > > drivers/dax/pmem.c | 2 +- > drivers/nvdimm/pmem.c | 13 +++++++++++-- > include/linux/mm.h | 14 -------------- > kernel/memremap.c | 22 +++++++++------------- > mm/swap.c | 10 ++++++++++ > 5 files changed, 31 insertions(+), 30 deletions(-) > > diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c > index 033f49b31fdc..cb0d742fa23f 100644 > --- a/drivers/dax/pmem.c > +++ b/drivers/dax/pmem.c > @@ -43,6 +43,7 @@ static void dax_pmem_percpu_exit(void *data) > struct dax_pmem *dax_pmem = to_dax_pmem(ref); > > dev_dbg(dax_pmem->dev, "%s\n", __func__); > + wait_for_completion(&dax_pmem->cmp); > percpu_ref_exit(ref); > } > > @@ -53,7 +54,6 @@ static void dax_pmem_percpu_kill(void *data) > > dev_dbg(dax_pmem->dev, "%s\n", __func__); > percpu_ref_kill(ref); > - wait_for_completion(&dax_pmem->cmp); > } > > static int dax_pmem_probe(struct device *dev) > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 5b536be5a12e..fb7bbc79ac26 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -25,6 +25,7 @@ > #include <linux/badblocks.h> > #include <linux/memremap.h> > #include <linux/vmalloc.h> > +#include <linux/blk-mq.h> > #include <linux/pfn_t.h> > #include <linux/slab.h> > #include <linux/pmem.h> > @@ -231,6 +232,11 @@ static void pmem_release_queue(void *q) > blk_cleanup_queue(q); > } > > +static void pmem_freeze_queue(void *q) > +{ > + blk_mq_freeze_queue_start(q); > +} > + > static void pmem_release_disk(void *disk) > { > del_gendisk(disk); > @@ -284,6 +290,9 @@ static int pmem_attach_disk(struct device *dev, > if (!q) > return -ENOMEM; > > + if (devm_add_action_or_reset(dev, pmem_release_queue, q)) > + return -ENOMEM; > + > pmem->pfn_flags = PFN_DEV; > if (is_nd_pfn(dev)) { > addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter, > @@ -303,10 +312,10 @@ static int pmem_attach_disk(struct device *dev, > pmem->size, ARCH_MEMREMAP_PMEM); > > /* > - * At release time the queue must be dead before > + * At release time the queue must be frozen before > * devm_memremap_pages is unwound > */ > - if (devm_add_action_or_reset(dev, pmem_release_queue, q)) > + if (devm_add_action_or_reset(dev, pmem_freeze_queue, q)) > return -ENOMEM; > > if (IS_ERR(addr)) > diff --git a/include/linux/mm.h b/include/linux/mm.h > index a835edd2db34..695da2a19b4c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -762,19 +762,11 @@ static inline enum zone_type page_zonenum(const struct > page *page) > } > > #ifdef CONFIG_ZONE_DEVICE > -void get_zone_device_page(struct page *page); > -void put_zone_device_page(struct page *page); > static inline bool is_zone_device_page(const struct page *page) > { > return page_zonenum(page) == ZONE_DEVICE; > } > #else > -static inline void get_zone_device_page(struct page *page) > -{ > -} > -static inline void put_zone_device_page(struct page *page) > -{ > -} > static inline bool is_zone_device_page(const struct page *page) > { > return false; > @@ -790,9 +782,6 @@ static inline void get_page(struct page *page) > */ > VM_BUG_ON_PAGE(page_ref_count(page) <= 0, page); > page_ref_inc(page); > - > - if (unlikely(is_zone_device_page(page))) > - get_zone_device_page(page); > } > > static inline void put_page(struct page *page) > @@ -801,9 +790,6 @@ static inline void put_page(struct page *page) > > if (put_page_testzero(page)) > __put_page(page); > - > - if (unlikely(is_zone_device_page(page))) > - put_zone_device_page(page); > } > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > diff --git a/kernel/memremap.c b/kernel/memremap.c > index 07e85e5229da..23a6483c3666 100644 > --- a/kernel/memremap.c > +++ b/kernel/memremap.c > @@ -182,18 +182,6 @@ struct page_map { > struct vmem_altmap altmap; > }; > > -void get_zone_device_page(struct page *page) > -{ > - percpu_ref_get(page->pgmap->ref); > -} > -EXPORT_SYMBOL(get_zone_device_page); > - > -void put_zone_device_page(struct page *page) > -{ > - put_dev_pagemap(page->pgmap); > -} > -EXPORT_SYMBOL(put_zone_device_page); > - > static void pgmap_radix_release(struct resource *res) > { > resource_size_t key, align_start, align_size, align_end; > @@ -237,6 +225,10 @@ static void devm_memremap_pages_release(struct device > *dev, void *data) > struct resource *res = &page_map->res; > resource_size_t align_start, align_size; > struct dev_pagemap *pgmap = &page_map->pgmap; > + unsigned long pfn; > + > + for_each_device_pfn(pfn, page_map) > + put_page(pfn_to_page(pfn)); > > if (percpu_ref_tryget_live(pgmap->ref)) { > dev_WARN(dev, "%s: page mapping is still live!\n", __func__); > @@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t > phys) > * > * Notes: > * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() > time > - * (or devm release event). > + * (or devm release event). The expected order of events is that @ref has > + * been through percpu_ref_kill() before devm_memremap_pages_release(). > The > + * wait for the completion of all references being dropped and > + * percpu_ref_exit() must occur after devm_memremap_pages_release(). > * > * 2/ @res is expected to be a host memory range that could feasibly be > * treated as a "System RAM" range, i.e. not a device mmio range, but > @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct > resource *res, > */ > list_del(&page->lru); > page->pgmap = pgmap; > + percpu_ref_get(ref); > } > devres_add(dev, page_map); > return __va(res->start); > diff --git a/mm/swap.c b/mm/swap.c > index 5dabf444d724..d8d9ee9e311a 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page) > > void __put_page(struct page *page) > { > + if (is_zone_device_page(page)) { > + put_dev_pagemap(page->pgmap); > + > + /* > + * The page belongs to the device that created pgmap. Do > + * not return it to page allocator. > + */ > + return; > + } > + > if (unlikely(PageCompound(page))) > __put_compound_page(page); > else > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-28 17:34 ` Jerome Glisse @ 2017-04-28 17:41 ` Dan Williams 2017-04-28 18:00 ` Jerome Glisse 0 siblings, 1 reply; 43+ messages in thread From: Dan Williams @ 2017-04-28 17:41 UTC (permalink / raw) To: Jerome Glisse Cc: Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com> wrote: >> Kirill points out that the calls to {get,put}_dev_pagemap() can be >> removed from the mm fast path if we take a single get_dev_pagemap() >> reference to signify that the page is alive and use the final put of the >> page to drop that reference. >> >> This does require some care to make sure that any waits for the >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(), >> since it now maintains its own elevated reference. > > This is NAK from HMM point of view as i need those call. So if you remove > them now i will need to add them back as part of HMM. I thought you only need them at page free time? You can still hook __put_page(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-28 17:41 ` Dan Williams @ 2017-04-28 18:00 ` Jerome Glisse 2017-04-28 19:02 ` Dan Williams 0 siblings, 1 reply; 43+ messages in thread From: Jerome Glisse @ 2017-04-28 18:00 UTC (permalink / raw) To: Dan Williams Cc: Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov > On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com> wrote: > >> Kirill points out that the calls to {get,put}_dev_pagemap() can be > >> removed from the mm fast path if we take a single get_dev_pagemap() > >> reference to signify that the page is alive and use the final put of the > >> page to drop that reference. > >> > >> This does require some care to make sure that any waits for the > >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(), > >> since it now maintains its own elevated reference. > > > > This is NAK from HMM point of view as i need those call. So if you remove > > them now i will need to add them back as part of HMM. > > I thought you only need them at page free time? You can still hook > __put_page(). No, i need a hook when page refcount reach 1, not 0. That being said i don't care about put_dev_pagemap(page->pgmap); so that part of the patch is fine from HMM point of view but i definitly need to hook my- self in the general put_page() function. So i will have to undo part of this patch for HMM (put_page() will need to handle ZONE_DEVICE page differently). Cheers, Jérôme -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-28 18:00 ` Jerome Glisse @ 2017-04-28 19:02 ` Dan Williams 2017-04-28 19:16 ` Jerome Glisse 0 siblings, 1 reply; 43+ messages in thread From: Dan Williams @ 2017-04-28 19:02 UTC (permalink / raw) To: Jerome Glisse Cc: Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse <jglisse@redhat.com> wrote: >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com> wrote: >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be >> >> removed from the mm fast path if we take a single get_dev_pagemap() >> >> reference to signify that the page is alive and use the final put of the >> >> page to drop that reference. >> >> >> >> This does require some care to make sure that any waits for the >> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(), >> >> since it now maintains its own elevated reference. >> > >> > This is NAK from HMM point of view as i need those call. So if you remove >> > them now i will need to add them back as part of HMM. >> >> I thought you only need them at page free time? You can still hook >> __put_page(). > > No, i need a hook when page refcount reach 1, not 0. That being said > i don't care about put_dev_pagemap(page->pgmap); so that part of the > patch is fine from HMM point of view but i definitly need to hook my- > self in the general put_page() function. > > So i will have to undo part of this patch for HMM (put_page() will > need to handle ZONE_DEVICE page differently). Ok, I'd rather this go in now since it fixes the existing use case, and unblocks the get_user_pages_fast() conversion to generic code. That also gives Kirill and -mm folks a chance to review what HMM wants to do on top of the page_ref infrastructure. The {get,put}_zone_device_page interface went in in 4.5 right before page_ref went in during 4.6, so it was just an oversight that {get,put}_zone_device_page were not removed earlier. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-28 19:02 ` Dan Williams @ 2017-04-28 19:16 ` Jerome Glisse 2017-04-28 19:22 ` Dan Williams 0 siblings, 1 reply; 43+ messages in thread From: Jerome Glisse @ 2017-04-28 19:16 UTC (permalink / raw) To: Dan Williams Cc: Ingo Molnar, linux-kernel, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov > On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse <jglisse@redhat.com> wrote: > >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com> > >> wrote: > >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be > >> >> removed from the mm fast path if we take a single get_dev_pagemap() > >> >> reference to signify that the page is alive and use the final put of > >> >> the > >> >> page to drop that reference. > >> >> > >> >> This does require some care to make sure that any waits for the > >> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(), > >> >> since it now maintains its own elevated reference. > >> > > >> > This is NAK from HMM point of view as i need those call. So if you > >> > remove > >> > them now i will need to add them back as part of HMM. > >> > >> I thought you only need them at page free time? You can still hook > >> __put_page(). > > > > No, i need a hook when page refcount reach 1, not 0. That being said > > i don't care about put_dev_pagemap(page->pgmap); so that part of the > > patch is fine from HMM point of view but i definitly need to hook my- > > self in the general put_page() function. > > > > So i will have to undo part of this patch for HMM (put_page() will > > need to handle ZONE_DEVICE page differently). > > Ok, I'd rather this go in now since it fixes the existing use case, > and unblocks the get_user_pages_fast() conversion to generic code. > That also gives Kirill and -mm folks a chance to review what HMM wants > to do on top of the page_ref infrastructure. The > {get,put}_zone_device_page interface went in in 4.5 right before > page_ref went in during 4.6, so it was just an oversight that > {get,put}_zone_device_page were not removed earlier. > I don't mind this going in, i am hopping people won't ignore HMM patchset once i repost after 4.12 merge window. Note that there is absolutely no way around me hooking up inside put_page(). The only other way to do it would be to modify virtualy all places that call that function to handle ZONE_DEVICE case. Jérôme -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-28 19:16 ` Jerome Glisse @ 2017-04-28 19:22 ` Dan Williams 2017-04-28 19:33 ` Jerome Glisse 0 siblings, 1 reply; 43+ messages in thread From: Dan Williams @ 2017-04-28 19:22 UTC (permalink / raw) To: Jerome Glisse Cc: Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Fri, Apr 28, 2017 at 12:16 PM, Jerome Glisse <jglisse@redhat.com> wrote: >> On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse <jglisse@redhat.com> wrote: >> >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com> >> >> wrote: >> >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be >> >> >> removed from the mm fast path if we take a single get_dev_pagemap() >> >> >> reference to signify that the page is alive and use the final put of >> >> >> the >> >> >> page to drop that reference. >> >> >> >> >> >> This does require some care to make sure that any waits for the >> >> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(), >> >> >> since it now maintains its own elevated reference. >> >> > >> >> > This is NAK from HMM point of view as i need those call. So if you >> >> > remove >> >> > them now i will need to add them back as part of HMM. >> >> >> >> I thought you only need them at page free time? You can still hook >> >> __put_page(). >> > >> > No, i need a hook when page refcount reach 1, not 0. That being said >> > i don't care about put_dev_pagemap(page->pgmap); so that part of the >> > patch is fine from HMM point of view but i definitly need to hook my- >> > self in the general put_page() function. >> > >> > So i will have to undo part of this patch for HMM (put_page() will >> > need to handle ZONE_DEVICE page differently). >> >> Ok, I'd rather this go in now since it fixes the existing use case, >> and unblocks the get_user_pages_fast() conversion to generic code. >> That also gives Kirill and -mm folks a chance to review what HMM wants >> to do on top of the page_ref infrastructure. The >> {get,put}_zone_device_page interface went in in 4.5 right before >> page_ref went in during 4.6, so it was just an oversight that >> {get,put}_zone_device_page were not removed earlier. >> > > I don't mind this going in, i am hopping people won't ignore HMM patchset > once i repost after 4.12 merge window. Note that there is absolutely no way > around me hooking up inside put_page(). The only other way to do it would > be to modify virtualy all places that call that function to handle ZONE_DEVICE > case. Are you sure about needing to hook the 2 -> 1 transition? Could we change ZONE_DEVICE pages to not have an elevated reference count when they are created so you can keep the HMM references out of the mm hot path? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-28 19:22 ` Dan Williams @ 2017-04-28 19:33 ` Jerome Glisse 2017-04-29 10:17 ` Kirill A. Shutemov 0 siblings, 1 reply; 43+ messages in thread From: Jerome Glisse @ 2017-04-28 19:33 UTC (permalink / raw) To: Dan Williams Cc: Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: > On Fri, Apr 28, 2017 at 12:16 PM, Jerome Glisse <jglisse@redhat.com> wrote: > >> On Fri, Apr 28, 2017 at 11:00 AM, Jerome Glisse <jglisse@redhat.com> wrote: > >> >> On Fri, Apr 28, 2017 at 10:34 AM, Jerome Glisse <jglisse@redhat.com> > >> >> wrote: > >> >> >> Kirill points out that the calls to {get,put}_dev_pagemap() can be > >> >> >> removed from the mm fast path if we take a single get_dev_pagemap() > >> >> >> reference to signify that the page is alive and use the final put of > >> >> >> the > >> >> >> page to drop that reference. > >> >> >> > >> >> >> This does require some care to make sure that any waits for the > >> >> >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(), > >> >> >> since it now maintains its own elevated reference. > >> >> > > >> >> > This is NAK from HMM point of view as i need those call. So if you > >> >> > remove > >> >> > them now i will need to add them back as part of HMM. > >> >> > >> >> I thought you only need them at page free time? You can still hook > >> >> __put_page(). > >> > > >> > No, i need a hook when page refcount reach 1, not 0. That being said > >> > i don't care about put_dev_pagemap(page->pgmap); so that part of the > >> > patch is fine from HMM point of view but i definitly need to hook my- > >> > self in the general put_page() function. > >> > > >> > So i will have to undo part of this patch for HMM (put_page() will > >> > need to handle ZONE_DEVICE page differently). > >> > >> Ok, I'd rather this go in now since it fixes the existing use case, > >> and unblocks the get_user_pages_fast() conversion to generic code. > >> That also gives Kirill and -mm folks a chance to review what HMM wants > >> to do on top of the page_ref infrastructure. The > >> {get,put}_zone_device_page interface went in in 4.5 right before > >> page_ref went in during 4.6, so it was just an oversight that > >> {get,put}_zone_device_page were not removed earlier. > >> > > > > I don't mind this going in, i am hopping people won't ignore HMM patchset > > once i repost after 4.12 merge window. Note that there is absolutely no way > > around me hooking up inside put_page(). The only other way to do it would > > be to modify virtualy all places that call that function to handle ZONE_DEVICE > > case. > > Are you sure about needing to hook the 2 -> 1 transition? Could we > change ZONE_DEVICE pages to not have an elevated reference count when > they are created so you can keep the HMM references out of the mm hot > path? 100% sure on that :) I need to callback into driver for 2->1 transition no way around that. If we change ZONE_DEVICE to not have an elevated reference count that you need to make a lot more change to mm so that ZONE_DEVICE is never use as fallback for memory allocation. Also need to make change to be sure that ZONE_DEVICE page never endup in one of the path that try to put them back on lru. There is a lot of place that would need to be updated and it would be highly intrusive and add a lot of special cases to other hot code path. Maybe i over estimate the amount of work but from top of my head it is far from being trivial. Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-28 19:33 ` Jerome Glisse @ 2017-04-29 10:17 ` Kirill A. Shutemov 2017-04-30 23:14 ` Jerome Glisse 0 siblings, 1 reply; 43+ messages in thread From: Kirill A. Shutemov @ 2017-04-29 10:17 UTC (permalink / raw) To: Jerome Glisse Cc: Dan Williams, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote: > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: > > Are you sure about needing to hook the 2 -> 1 transition? Could we > > change ZONE_DEVICE pages to not have an elevated reference count when > > they are created so you can keep the HMM references out of the mm hot > > path? > > 100% sure on that :) I need to callback into driver for 2->1 transition > no way around that. If we change ZONE_DEVICE to not have an elevated > reference count that you need to make a lot more change to mm so that > ZONE_DEVICE is never use as fallback for memory allocation. Also need > to make change to be sure that ZONE_DEVICE page never endup in one of > the path that try to put them back on lru. There is a lot of place that > would need to be updated and it would be highly intrusive and add a > lot of special cases to other hot code path. Could you explain more on where the requirement comes from or point me to where I can read about this. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-29 10:17 ` Kirill A. Shutemov @ 2017-04-30 23:14 ` Jerome Glisse 2017-05-01 1:42 ` Dan Williams ` (2 more replies) 0 siblings, 3 replies; 43+ messages in thread From: Jerome Glisse @ 2017-04-30 23:14 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dan Williams, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote: > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote: > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: > > > Are you sure about needing to hook the 2 -> 1 transition? Could we > > > change ZONE_DEVICE pages to not have an elevated reference count when > > > they are created so you can keep the HMM references out of the mm hot > > > path? > > > > 100% sure on that :) I need to callback into driver for 2->1 transition > > no way around that. If we change ZONE_DEVICE to not have an elevated > > reference count that you need to make a lot more change to mm so that > > ZONE_DEVICE is never use as fallback for memory allocation. Also need > > to make change to be sure that ZONE_DEVICE page never endup in one of > > the path that try to put them back on lru. There is a lot of place that > > would need to be updated and it would be highly intrusive and add a > > lot of special cases to other hot code path. > > Could you explain more on where the requirement comes from or point me to > where I can read about this. > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page) in _any_ vma. So i need to know when a page is freed ie either as result of unmap, exit or migration or anything that would free the memory. For zone device a page is free once its refcount reach 1 so i need to catch refcount transition from 2->1 This is the only way i can inform the device that the page is now free. See https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d There is _no_ way around that. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-30 23:14 ` Jerome Glisse @ 2017-05-01 1:42 ` Dan Williams 2017-05-01 1:54 ` Jerome Glisse 2017-05-01 3:48 ` Logan Gunthorpe 2017-05-01 10:23 ` Kirill A. Shutemov 2 siblings, 1 reply; 43+ messages in thread From: Dan Williams @ 2017-05-01 1:42 UTC (permalink / raw) To: Jerome Glisse Cc: Kirill A. Shutemov, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse <jglisse@redhat.com> wrote: > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote: >> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote: >> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: >> > > Are you sure about needing to hook the 2 -> 1 transition? Could we >> > > change ZONE_DEVICE pages to not have an elevated reference count when >> > > they are created so you can keep the HMM references out of the mm hot >> > > path? >> > >> > 100% sure on that :) I need to callback into driver for 2->1 transition >> > no way around that. If we change ZONE_DEVICE to not have an elevated >> > reference count that you need to make a lot more change to mm so that >> > ZONE_DEVICE is never use as fallback for memory allocation. Also need >> > to make change to be sure that ZONE_DEVICE page never endup in one of >> > the path that try to put them back on lru. There is a lot of place that >> > would need to be updated and it would be highly intrusive and add a >> > lot of special cases to other hot code path. >> >> Could you explain more on where the requirement comes from or point me to >> where I can read about this. >> > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page) > in _any_ vma. So i need to know when a page is freed ie either as result of > unmap, exit or migration or anything that would free the memory. For zone > device a page is free once its refcount reach 1 so i need to catch refcount > transition from 2->1 > > This is the only way i can inform the device that the page is now free. See > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d > > There is _no_ way around that. Ok, but I need to point out that this not a ZONE_DEVICE requirement. This is an HMM-specific need. So, this extra reference counting should be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case. Can we hide the extra reference counting behind a static branch so that the common case fast path doesn't get slower until a HMM device shows up? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-05-01 1:42 ` Dan Williams @ 2017-05-01 1:54 ` Jerome Glisse 2017-05-01 2:40 ` Dan Williams 0 siblings, 1 reply; 43+ messages in thread From: Jerome Glisse @ 2017-05-01 1:54 UTC (permalink / raw) To: Dan Williams Cc: Kirill A. Shutemov, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Sun, Apr 30, 2017 at 06:42:02PM -0700, Dan Williams wrote: > On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse <jglisse@redhat.com> wrote: > > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote: > >> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote: > >> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: > >> > > Are you sure about needing to hook the 2 -> 1 transition? Could we > >> > > change ZONE_DEVICE pages to not have an elevated reference count when > >> > > they are created so you can keep the HMM references out of the mm hot > >> > > path? > >> > > >> > 100% sure on that :) I need to callback into driver for 2->1 transition > >> > no way around that. If we change ZONE_DEVICE to not have an elevated > >> > reference count that you need to make a lot more change to mm so that > >> > ZONE_DEVICE is never use as fallback for memory allocation. Also need > >> > to make change to be sure that ZONE_DEVICE page never endup in one of > >> > the path that try to put them back on lru. There is a lot of place that > >> > would need to be updated and it would be highly intrusive and add a > >> > lot of special cases to other hot code path. > >> > >> Could you explain more on where the requirement comes from or point me to > >> where I can read about this. > >> > > > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page) > > in _any_ vma. So i need to know when a page is freed ie either as result of > > unmap, exit or migration or anything that would free the memory. For zone > > device a page is free once its refcount reach 1 so i need to catch refcount > > transition from 2->1 > > > > This is the only way i can inform the device that the page is now free. See > > > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d > > > > There is _no_ way around that. > > Ok, but I need to point out that this not a ZONE_DEVICE requirement. > This is an HMM-specific need. So, this extra reference counting should > be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case. And it already is delimited, i think you even gave your review by on the patch. > Can we hide the extra reference counting behind a static branch so > that the common case fast path doesn't get slower until a HMM device > shows up? Like i already did With likely()/unlikely() ? Or something else ? https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=e84778e9db0672e371eb6599dfcb812512118842 Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-05-01 1:54 ` Jerome Glisse @ 2017-05-01 2:40 ` Dan Williams 0 siblings, 0 replies; 43+ messages in thread From: Dan Williams @ 2017-05-01 2:40 UTC (permalink / raw) To: Jerome Glisse Cc: Kirill A. Shutemov, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Sun, Apr 30, 2017 at 6:54 PM, Jerome Glisse <jglisse@redhat.com> wrote: > On Sun, Apr 30, 2017 at 06:42:02PM -0700, Dan Williams wrote: >> On Sun, Apr 30, 2017 at 4:14 PM, Jerome Glisse <jglisse@redhat.com> wrote: >> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote: >> >> On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote: >> >> > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: >> >> > > Are you sure about needing to hook the 2 -> 1 transition? Could we >> >> > > change ZONE_DEVICE pages to not have an elevated reference count when >> >> > > they are created so you can keep the HMM references out of the mm hot >> >> > > path? >> >> > >> >> > 100% sure on that :) I need to callback into driver for 2->1 transition >> >> > no way around that. If we change ZONE_DEVICE to not have an elevated >> >> > reference count that you need to make a lot more change to mm so that >> >> > ZONE_DEVICE is never use as fallback for memory allocation. Also need >> >> > to make change to be sure that ZONE_DEVICE page never endup in one of >> >> > the path that try to put them back on lru. There is a lot of place that >> >> > would need to be updated and it would be highly intrusive and add a >> >> > lot of special cases to other hot code path. >> >> >> >> Could you explain more on where the requirement comes from or point me to >> >> where I can read about this. >> >> >> > >> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page) >> > in _any_ vma. So i need to know when a page is freed ie either as result of >> > unmap, exit or migration or anything that would free the memory. For zone >> > device a page is free once its refcount reach 1 so i need to catch refcount >> > transition from 2->1 >> > >> > This is the only way i can inform the device that the page is now free. See >> > >> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d >> > >> > There is _no_ way around that. >> >> Ok, but I need to point out that this not a ZONE_DEVICE requirement. >> This is an HMM-specific need. So, this extra reference counting should >> be clearly delineated as part of the MEMORY_DEVICE_PRIVATE use case. > > And it already is delimited, i think you even gave your review by on > the patch. > I gave my reviewed-by to the patch that leveraged {get,put}_zone_device_page(), but now those are gone and HMM needs a new patch. I'm just saying these reference counts should not come back as {get,put}_zone_device_page() because now they're HMM specific. >> Can we hide the extra reference counting behind a static branch so >> that the common case fast path doesn't get slower until a HMM device >> shows up? > > Like i already did With likely()/unlikely() ? Or something else ? > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=e84778e9db0672e371eb6599dfcb812512118842 Something different: http://lxr.free-electrons.com/source/Documentation/static-keys.txt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-30 23:14 ` Jerome Glisse 2017-05-01 1:42 ` Dan Williams @ 2017-05-01 3:48 ` Logan Gunthorpe 2017-05-01 10:23 ` Kirill A. Shutemov 2 siblings, 0 replies; 43+ messages in thread From: Logan Gunthorpe @ 2017-05-01 3:48 UTC (permalink / raw) To: Jerome Glisse, Kirill A. Shutemov Cc: Dan Williams, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Kirill Shutemov On 30/04/17 05:14 PM, Jerome Glisse wrote: > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page) > in _any_ vma. So i need to know when a page is freed ie either as result of > unmap, exit or migration or anything that would free the memory. For zone > device a page is free once its refcount reach 1 so i need to catch refcount > transition from 2->1 > > This is the only way i can inform the device that the page is now free. See > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d > > There is _no_ way around that. I had a similar issue in a piece of my p2pmem RFC [1]. I hacked around it by tracking the pages separately and freeing them when the vma is closed. This is by no means a great solution, it certainly has it's own warts. However, maybe it will spark some ideas for some alternate choices which avoid the hot path. Another thing I briefly looked at was hooking the vma close process earlier so that it would callback in time that you can loop through the pages and do your free process. Of course this all depends on the vma not getting closed while the pages have other references. So it may not work at all. Again, just ideas. Logan [1] https://github.com/sbates130272/linux-p2pmem/commit/77c631d92cb5c451c9824b3a4cf9b6cddfde6bb7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-30 23:14 ` Jerome Glisse 2017-05-01 1:42 ` Dan Williams 2017-05-01 3:48 ` Logan Gunthorpe @ 2017-05-01 10:23 ` Kirill A. Shutemov 2017-05-01 13:55 ` Jerome Glisse 2 siblings, 1 reply; 43+ messages in thread From: Kirill A. Shutemov @ 2017-05-01 10:23 UTC (permalink / raw) To: Jerome Glisse Cc: Dan Williams, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote: > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote: > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote: > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we > > > > change ZONE_DEVICE pages to not have an elevated reference count when > > > > they are created so you can keep the HMM references out of the mm hot > > > > path? > > > > > > 100% sure on that :) I need to callback into driver for 2->1 transition > > > no way around that. If we change ZONE_DEVICE to not have an elevated > > > reference count that you need to make a lot more change to mm so that > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need > > > to make change to be sure that ZONE_DEVICE page never endup in one of > > > the path that try to put them back on lru. There is a lot of place that > > > would need to be updated and it would be highly intrusive and add a > > > lot of special cases to other hot code path. > > > > Could you explain more on where the requirement comes from or point me to > > where I can read about this. > > > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page) > in _any_ vma. So i need to know when a page is freed ie either as result of > unmap, exit or migration or anything that would free the memory. For zone > device a page is free once its refcount reach 1 so i need to catch refcount > transition from 2->1 What if we would rework zone device to have pages with refcount 0 at start? > This is the only way i can inform the device that the page is now free. See > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d > > There is _no_ way around that. I'm still not convinced that it's impossible. Could you describe lifecycle for pages in case of HMM? -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-05-01 10:23 ` Kirill A. Shutemov @ 2017-05-01 13:55 ` Jerome Glisse 2017-05-01 20:19 ` Dan Williams 2017-05-02 11:37 ` Kirill A. Shutemov 0 siblings, 2 replies; 43+ messages in thread From: Jerome Glisse @ 2017-05-01 13:55 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dan Williams, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote: > On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote: > > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote: > > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote: > > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: > > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we > > > > > change ZONE_DEVICE pages to not have an elevated reference count when > > > > > they are created so you can keep the HMM references out of the mm hot > > > > > path? > > > > > > > > 100% sure on that :) I need to callback into driver for 2->1 transition > > > > no way around that. If we change ZONE_DEVICE to not have an elevated > > > > reference count that you need to make a lot more change to mm so that > > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need > > > > to make change to be sure that ZONE_DEVICE page never endup in one of > > > > the path that try to put them back on lru. There is a lot of place that > > > > would need to be updated and it would be highly intrusive and add a > > > > lot of special cases to other hot code path. > > > > > > Could you explain more on where the requirement comes from or point me to > > > where I can read about this. > > > > > > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page) > > in _any_ vma. So i need to know when a page is freed ie either as result of > > unmap, exit or migration or anything that would free the memory. For zone > > device a page is free once its refcount reach 1 so i need to catch refcount > > transition from 2->1 > > What if we would rework zone device to have pages with refcount 0 at > start? That is a _lot_ of work from top of my head because it would need changes to a lot of places and likely more hot code path that simply adding some- thing to put_page() note that i only need something in put_page() i do not need anything in the get page path. Is adding a conditional branch for HMM pages in put_page() that much of a problem ? > > This is the only way i can inform the device that the page is now free. See > > > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d > > > > There is _no_ way around that. > > I'm still not convinced that it's impossible. > > Could you describe lifecycle for pages in case of HMM? Process malloc something, end it over to some function in the program that use the GPU that function call GPU API (OpenCL, CUDA, ...) that trigger a migration to device memory. So in the kernel you get a migration like any existing migration, original page is unmap, if refcount is all ok (no pin) then a device page is allocated and thing are migrated to device memory. What happen after is unknown. Either userspace/kernel driver decide to migrate back to system memory, either there is an munmap, either there is a CPU page fault, ... So from that point on the device page as the exact same life as a regular page. Above i describe the migrate case, but you can also have new memory allocation that directly allocate device memory. For instance if the GPU do a page fault on an address that isn't back by anything then we can directly allocate a device page. No migration involve in that case. HMM pages are like any other pages in most respect. Exception are: - no GUP - no KSM - no lru reclaim - no NUMA balancing - no regular migration (existing migrate_page) The fact that minimum refcount for ZONE_DEVICE is 1 already gives us for free most of the above exception. To convert the refcount to be like other pages would mean that all of the above would need to be audited and probably modify to ignore ZONE_DEVICE pages (i am pretty sure Dan do not want any of the above either). Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-05-01 13:55 ` Jerome Glisse @ 2017-05-01 20:19 ` Dan Williams 2017-05-01 20:32 ` Jerome Glisse 2017-05-02 11:37 ` Kirill A. Shutemov 1 sibling, 1 reply; 43+ messages in thread From: Dan Williams @ 2017-05-01 20:19 UTC (permalink / raw) To: Jerome Glisse Cc: Kirill A. Shutemov, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Mon, May 1, 2017 at 6:55 AM, Jerome Glisse <jglisse@redhat.com> wrote: > On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote: >> On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote: >> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote: >> > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote: >> > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: >> > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we >> > > > > change ZONE_DEVICE pages to not have an elevated reference count when >> > > > > they are created so you can keep the HMM references out of the mm hot >> > > > > path? >> > > > >> > > > 100% sure on that :) I need to callback into driver for 2->1 transition >> > > > no way around that. If we change ZONE_DEVICE to not have an elevated >> > > > reference count that you need to make a lot more change to mm so that >> > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need >> > > > to make change to be sure that ZONE_DEVICE page never endup in one of >> > > > the path that try to put them back on lru. There is a lot of place that >> > > > would need to be updated and it would be highly intrusive and add a >> > > > lot of special cases to other hot code path. >> > > >> > > Could you explain more on where the requirement comes from or point me to >> > > where I can read about this. >> > > >> > >> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page) >> > in _any_ vma. So i need to know when a page is freed ie either as result of >> > unmap, exit or migration or anything that would free the memory. For zone >> > device a page is free once its refcount reach 1 so i need to catch refcount >> > transition from 2->1 >> >> What if we would rework zone device to have pages with refcount 0 at >> start? > > That is a _lot_ of work from top of my head because it would need changes > to a lot of places and likely more hot code path that simply adding some- > thing to put_page() note that i only need something in put_page() i do not > need anything in the get page path. Is adding a conditional branch for > HMM pages in put_page() that much of a problem ? > > >> > This is the only way i can inform the device that the page is now free. See >> > >> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d >> > >> > There is _no_ way around that. >> >> I'm still not convinced that it's impossible. >> >> Could you describe lifecycle for pages in case of HMM? > > Process malloc something, end it over to some function in the program > that use the GPU that function call GPU API (OpenCL, CUDA, ...) that > trigger a migration to device memory. > > So in the kernel you get a migration like any existing migration, > original page is unmap, if refcount is all ok (no pin) then a device > page is allocated and thing are migrated to device memory. > > What happen after is unknown. Either userspace/kernel driver decide > to migrate back to system memory, either there is an munmap, either > there is a CPU page fault, ... So from that point on the device page > as the exact same life as a regular page. > > Above i describe the migrate case, but you can also have new memory > allocation that directly allocate device memory. For instance if the > GPU do a page fault on an address that isn't back by anything then > we can directly allocate a device page. No migration involve in that > case. > > HMM pages are like any other pages in most respect. Exception are: > - no GUP > - no KSM > - no lru reclaim > - no NUMA balancing > - no regular migration (existing migrate_page) > > The fact that minimum refcount for ZONE_DEVICE is 1 already gives > us for free most of the above exception. To convert the refcount to > be like other pages would mean that all of the above would need to > be audited and probably modify to ignore ZONE_DEVICE pages (i am > pretty sure Dan do not want any of the above either). Right, adding HMM references to get_page() and put_page() seems less intrusive. Given how uncommon HMM hardware is (insert grumble about no visible upstream user of this functionality) I think the 'static branch' approach helps mitigate the impact for everything else. Looking back, I should have used that mechanism for the pmem use case, but it's moot now. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-05-01 20:19 ` Dan Williams @ 2017-05-01 20:32 ` Jerome Glisse 0 siblings, 0 replies; 43+ messages in thread From: Jerome Glisse @ 2017-05-01 20:32 UTC (permalink / raw) To: Dan Williams Cc: Kirill A. Shutemov, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Mon, May 01, 2017 at 01:19:24PM -0700, Dan Williams wrote: > On Mon, May 1, 2017 at 6:55 AM, Jerome Glisse <jglisse@redhat.com> wrote: > > On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote: > >> On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote: > >> > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote: > >> > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote: > >> > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: > >> > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we > >> > > > > change ZONE_DEVICE pages to not have an elevated reference count when > >> > > > > they are created so you can keep the HMM references out of the mm hot > >> > > > > path? > >> > > > > >> > > > 100% sure on that :) I need to callback into driver for 2->1 transition > >> > > > no way around that. If we change ZONE_DEVICE to not have an elevated > >> > > > reference count that you need to make a lot more change to mm so that > >> > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need > >> > > > to make change to be sure that ZONE_DEVICE page never endup in one of > >> > > > the path that try to put them back on lru. There is a lot of place that > >> > > > would need to be updated and it would be highly intrusive and add a > >> > > > lot of special cases to other hot code path. > >> > > > >> > > Could you explain more on where the requirement comes from or point me to > >> > > where I can read about this. > >> > > > >> > > >> > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page) > >> > in _any_ vma. So i need to know when a page is freed ie either as result of > >> > unmap, exit or migration or anything that would free the memory. For zone > >> > device a page is free once its refcount reach 1 so i need to catch refcount > >> > transition from 2->1 > >> > >> What if we would rework zone device to have pages with refcount 0 at > >> start? > > > > That is a _lot_ of work from top of my head because it would need changes > > to a lot of places and likely more hot code path that simply adding some- > > thing to put_page() note that i only need something in put_page() i do not > > need anything in the get page path. Is adding a conditional branch for > > HMM pages in put_page() that much of a problem ? > > > > > >> > This is the only way i can inform the device that the page is now free. See > >> > > >> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d > >> > > >> > There is _no_ way around that. > >> > >> I'm still not convinced that it's impossible. > >> > >> Could you describe lifecycle for pages in case of HMM? > > > > Process malloc something, end it over to some function in the program > > that use the GPU that function call GPU API (OpenCL, CUDA, ...) that > > trigger a migration to device memory. > > > > So in the kernel you get a migration like any existing migration, > > original page is unmap, if refcount is all ok (no pin) then a device > > page is allocated and thing are migrated to device memory. > > > > What happen after is unknown. Either userspace/kernel driver decide > > to migrate back to system memory, either there is an munmap, either > > there is a CPU page fault, ... So from that point on the device page > > as the exact same life as a regular page. > > > > Above i describe the migrate case, but you can also have new memory > > allocation that directly allocate device memory. For instance if the > > GPU do a page fault on an address that isn't back by anything then > > we can directly allocate a device page. No migration involve in that > > case. > > > > HMM pages are like any other pages in most respect. Exception are: > > - no GUP > > - no KSM > > - no lru reclaim > > - no NUMA balancing > > - no regular migration (existing migrate_page) > > > > The fact that minimum refcount for ZONE_DEVICE is 1 already gives > > us for free most of the above exception. To convert the refcount to > > be like other pages would mean that all of the above would need to > > be audited and probably modify to ignore ZONE_DEVICE pages (i am > > pretty sure Dan do not want any of the above either). > > Right, adding HMM references to get_page() and put_page() seems less > intrusive. Given how uncommon HMM hardware is (insert grumble about no > visible upstream user of this functionality) I think the 'static > branch' approach helps mitigate the impact for everything else. > Looking back, I should have used that mechanism for the pmem use case, > but it's moot now. I do not need anything in get_page() all i need is something in put_page() to catch the 2 -> 1 refcount transition to know when a page is freed. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-05-01 13:55 ` Jerome Glisse 2017-05-01 20:19 ` Dan Williams @ 2017-05-02 11:37 ` Kirill A. Shutemov 2017-05-02 13:22 ` Jerome Glisse 1 sibling, 1 reply; 43+ messages in thread From: Kirill A. Shutemov @ 2017-05-02 11:37 UTC (permalink / raw) To: Jerome Glisse Cc: Dan Williams, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Mon, May 01, 2017 at 09:55:48AM -0400, Jerome Glisse wrote: > On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote: > > On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote: > > > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote: > > > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote: > > > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: > > > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we > > > > > > change ZONE_DEVICE pages to not have an elevated reference count when > > > > > > they are created so you can keep the HMM references out of the mm hot > > > > > > path? > > > > > > > > > > 100% sure on that :) I need to callback into driver for 2->1 transition > > > > > no way around that. If we change ZONE_DEVICE to not have an elevated > > > > > reference count that you need to make a lot more change to mm so that > > > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need > > > > > to make change to be sure that ZONE_DEVICE page never endup in one of > > > > > the path that try to put them back on lru. There is a lot of place that > > > > > would need to be updated and it would be highly intrusive and add a > > > > > lot of special cases to other hot code path. > > > > > > > > Could you explain more on where the requirement comes from or point me to > > > > where I can read about this. > > > > > > > > > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page) > > > in _any_ vma. So i need to know when a page is freed ie either as result of > > > unmap, exit or migration or anything that would free the memory. For zone > > > device a page is free once its refcount reach 1 so i need to catch refcount > > > transition from 2->1 > > > > What if we would rework zone device to have pages with refcount 0 at > > start? > > That is a _lot_ of work from top of my head because it would need changes > to a lot of places and likely more hot code path that simply adding some- > thing to put_page() note that i only need something in put_page() i do not > need anything in the get page path. Is adding a conditional branch for > HMM pages in put_page() that much of a problem ? Well, it gets inlined everywhere. Removing zone_device code from get_page() and put_page() saved non-trivial ~140k in vmlinux for allyesconfig. Re-introducing part this bloat would be unfortunate. > > > This is the only way i can inform the device that the page is now free. See > > > > > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d > > > > > > There is _no_ way around that. > > > > I'm still not convinced that it's impossible. > > > > Could you describe lifecycle for pages in case of HMM? > > Process malloc something, end it over to some function in the program > that use the GPU that function call GPU API (OpenCL, CUDA, ...) that > trigger a migration to device memory. > > So in the kernel you get a migration like any existing migration, > original page is unmap, if refcount is all ok (no pin) then a device > page is allocated and thing are migrated to device memory. > > What happen after is unknown. Either userspace/kernel driver decide > to migrate back to system memory, either there is an munmap, either > there is a CPU page fault, ... So from that point on the device page > as the exact same life as a regular page. > > Above i describe the migrate case, but you can also have new memory > allocation that directly allocate device memory. For instance if the > GPU do a page fault on an address that isn't back by anything then > we can directly allocate a device page. No migration involve in that > case. > > HMM pages are like any other pages in most respect. Exception are: > - no GUP Hm. How do you exclude GUP? And why is it required? -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-05-02 11:37 ` Kirill A. Shutemov @ 2017-05-02 13:22 ` Jerome Glisse 0 siblings, 0 replies; 43+ messages in thread From: Jerome Glisse @ 2017-05-02 13:22 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Dan Williams, Ingo Molnar, linux-kernel@vger.kernel.org, Linux MM, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Tue, May 02, 2017 at 02:37:46PM +0300, Kirill A. Shutemov wrote: > On Mon, May 01, 2017 at 09:55:48AM -0400, Jerome Glisse wrote: > > On Mon, May 01, 2017 at 01:23:59PM +0300, Kirill A. Shutemov wrote: > > > On Sun, Apr 30, 2017 at 07:14:24PM -0400, Jerome Glisse wrote: > > > > On Sat, Apr 29, 2017 at 01:17:26PM +0300, Kirill A. Shutemov wrote: > > > > > On Fri, Apr 28, 2017 at 03:33:07PM -0400, Jerome Glisse wrote: > > > > > > On Fri, Apr 28, 2017 at 12:22:24PM -0700, Dan Williams wrote: > > > > > > > Are you sure about needing to hook the 2 -> 1 transition? Could we > > > > > > > change ZONE_DEVICE pages to not have an elevated reference count when > > > > > > > they are created so you can keep the HMM references out of the mm hot > > > > > > > path? > > > > > > > > > > > > 100% sure on that :) I need to callback into driver for 2->1 transition > > > > > > no way around that. If we change ZONE_DEVICE to not have an elevated > > > > > > reference count that you need to make a lot more change to mm so that > > > > > > ZONE_DEVICE is never use as fallback for memory allocation. Also need > > > > > > to make change to be sure that ZONE_DEVICE page never endup in one of > > > > > > the path that try to put them back on lru. There is a lot of place that > > > > > > would need to be updated and it would be highly intrusive and add a > > > > > > lot of special cases to other hot code path. > > > > > > > > > > Could you explain more on where the requirement comes from or point me to > > > > > where I can read about this. > > > > > > > > > > > > > HMM ZONE_DEVICE pages are use like other pages (anonymous or file back page) > > > > in _any_ vma. So i need to know when a page is freed ie either as result of > > > > unmap, exit or migration or anything that would free the memory. For zone > > > > device a page is free once its refcount reach 1 so i need to catch refcount > > > > transition from 2->1 > > > > > > What if we would rework zone device to have pages with refcount 0 at > > > start? > > > > That is a _lot_ of work from top of my head because it would need changes > > to a lot of places and likely more hot code path that simply adding some- > > thing to put_page() note that i only need something in put_page() i do not > > need anything in the get page path. Is adding a conditional branch for > > HMM pages in put_page() that much of a problem ? > > Well, it gets inlined everywhere. Removing zone_device code from > get_page() and put_page() saved non-trivial ~140k in vmlinux for > allyesconfig. > > Re-introducing part this bloat would be unfortunate. > > > > > This is the only way i can inform the device that the page is now free. See > > > > > > > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-v21&id=52da8fe1a088b87b5321319add79e43b8372ed7d > > > > > > > > There is _no_ way around that. > > > > > > I'm still not convinced that it's impossible. > > > > > > Could you describe lifecycle for pages in case of HMM? > > > > Process malloc something, end it over to some function in the program > > that use the GPU that function call GPU API (OpenCL, CUDA, ...) that > > trigger a migration to device memory. > > > > So in the kernel you get a migration like any existing migration, > > original page is unmap, if refcount is all ok (no pin) then a device > > page is allocated and thing are migrated to device memory. > > > > What happen after is unknown. Either userspace/kernel driver decide > > to migrate back to system memory, either there is an munmap, either > > there is a CPU page fault, ... So from that point on the device page > > as the exact same life as a regular page. > > > > Above i describe the migrate case, but you can also have new memory > > allocation that directly allocate device memory. For instance if the > > GPU do a page fault on an address that isn't back by anything then > > we can directly allocate a device page. No migration involve in that > > case. > > > > HMM pages are like any other pages in most respect. Exception are: > > - no GUP > > Hm. How do you exclude GUP? And why is it required? Well it is not forbiden it just can not happen simply because as device memory is not accessible by CPU then the corresponding CPU page table entry is a special entry and thus GUP trigger a page fault that migrate thing back to regular memory. The why is simply because we need to always be able to migrate back to regular memory as device memory is not accessible by the CPU. So we can not allow anyone to pin it. Cheers, Jerome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-28 17:23 ` [PATCH v2] mm, zone_device: replace " Dan Williams 2017-04-28 17:34 ` Jerome Glisse @ 2017-04-29 14:18 ` Ingo Molnar 2017-05-01 2:45 ` Dan Williams 1 sibling, 1 reply; 43+ messages in thread From: Ingo Molnar @ 2017-04-29 14:18 UTC (permalink / raw) To: Dan Williams Cc: linux-kernel, linux-mm, Jérôme Glisse, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov * Dan Williams <dan.j.williams@intel.com> wrote: > Kirill points out that the calls to {get,put}_dev_pagemap() can be > removed from the mm fast path if we take a single get_dev_pagemap() > reference to signify that the page is alive and use the final put of the > page to drop that reference. > > This does require some care to make sure that any waits for the > percpu_ref to drop to zero occur *after* devm_memremap_page_release(), > since it now maintains its own elevated reference. > > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Jerome Glisse <jglisse@redhat.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Reviewed-by: Logan Gunthorpe <logang@deltatee.com> > Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> > Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> This changelog is lacking an explanation about how this solves the crashes you were seeing. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-29 14:18 ` Ingo Molnar @ 2017-05-01 2:45 ` Dan Williams 2017-05-01 7:12 ` Ingo Molnar 0 siblings, 1 reply; 43+ messages in thread From: Dan Williams @ 2017-05-01 2:45 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, Linux MM, Jérôme Glisse, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Sat, Apr 29, 2017 at 7:18 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Dan Williams <dan.j.williams@intel.com> wrote: > >> Kirill points out that the calls to {get,put}_dev_pagemap() can be >> removed from the mm fast path if we take a single get_dev_pagemap() >> reference to signify that the page is alive and use the final put of the >> page to drop that reference. >> >> This does require some care to make sure that any waits for the >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(), >> since it now maintains its own elevated reference. >> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: Jérôme Glisse <jglisse@redhat.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> >> Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> >> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> > > This changelog is lacking an explanation about how this solves the crashes you > were seeing. > Kirill? It wasn't clear to me why the conversion to generic get_user_pages_fast() caused the reference counts to be off. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-05-01 2:45 ` Dan Williams @ 2017-05-01 7:12 ` Ingo Molnar 2017-05-01 9:33 ` Kirill A. Shutemov 0 siblings, 1 reply; 43+ messages in thread From: Ingo Molnar @ 2017-05-01 7:12 UTC (permalink / raw) To: Dan Williams Cc: linux-kernel@vger.kernel.org, Linux MM, Jérôme Glisse, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov * Dan Williams <dan.j.williams@intel.com> wrote: > On Sat, Apr 29, 2017 at 7:18 AM, Ingo Molnar <mingo@kernel.org> wrote: > > > > * Dan Williams <dan.j.williams@intel.com> wrote: > > > >> Kirill points out that the calls to {get,put}_dev_pagemap() can be > >> removed from the mm fast path if we take a single get_dev_pagemap() > >> reference to signify that the page is alive and use the final put of the > >> page to drop that reference. > >> > >> This does require some care to make sure that any waits for the > >> percpu_ref to drop to zero occur *after* devm_memremap_page_release(), > >> since it now maintains its own elevated reference. > >> > >> Cc: Ingo Molnar <mingo@redhat.com> > >> Cc: Jerome Glisse <jglisse@redhat.com> > >> Cc: Andrew Morton <akpm@linux-foundation.org> > >> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> > >> Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> > >> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> > >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> > > > > This changelog is lacking an explanation about how this solves the crashes you > > were seeing. > > Kirill? It wasn't clear to me why the conversion to generic > get_user_pages_fast() caused the reference counts to be off. Ok, the merge window is open and we really need this fix for x86/mm, so this is what I've decoded: The x86 conversion to the generic GUP code included a small change which causes crashes and data corruption in the pmem code - not good. The root cause is that the /dev/pmem driver code implicitly relies on the x86 get_user_pages() implementation doing a get_page() on the page refcount, because get_page() does a get_zone_device_page() which properly refcounts pmem's separate page struct arrays that are not present in the regular page struct structures. (The pmem driver does this because it can cover huge memory areas.) But the x86 conversion to the generic GUP code changed the get_page() to page_cache_get_speculative() which is faster but doesn't do the get_zone_device_page() call the pmem code relies on. One way to solve the regression would be to change the generic GUP code to use get_page(), but that would slow things down a bit and punish other generic-GUP using architectures for an x86-ism they did not care about. (Arguably the pmem driver was probably not working reliably for them: but nvdimm is an Intel feature, so non-x86 exposure is probably still limited.) So restructure the pmem code's interface with the MM instead: get rid of the get/put_zone_device_page() distinction, integrate put_zone_device_page() into __put_page() and and restructure the pmem completion-wait and teardown machinery. This speeds up things while also making the pmem refcounting more robust going forward. ... is this extension to the changelog correct? I'll apply this for the time being - but can still amend the text before sending it to Linus later today. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v2] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-05-01 7:12 ` Ingo Molnar @ 2017-05-01 9:33 ` Kirill A. Shutemov 0 siblings, 0 replies; 43+ messages in thread From: Kirill A. Shutemov @ 2017-05-01 9:33 UTC (permalink / raw) To: Ingo Molnar Cc: Dan Williams, linux-kernel@vger.kernel.org, Linux MM, Jérôme Glisse, Ingo Molnar, Andrew Morton, Logan Gunthorpe, Kirill Shutemov On Mon, May 01, 2017 at 09:12:59AM +0200, Ingo Molnar wrote: > ... is this extension to the changelog correct? Looks good to me. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-27 0:55 ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Dan Williams 2017-04-27 8:33 ` Kirill A. Shutemov @ 2017-04-27 16:11 ` Logan Gunthorpe 2017-04-27 16:14 ` Dan Williams 1 sibling, 1 reply; 43+ messages in thread From: Logan Gunthorpe @ 2017-04-27 16:11 UTC (permalink / raw) To: Dan Williams, akpm Cc: linux-mm, Jérôme Glisse, linux-kernel, Kirill Shutemov On 26/04/17 06:55 PM, Dan Williams wrote: > @@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys) > * > * Notes: > * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time > - * (or devm release event). > + * (or devm release event). The expected order of events is that @ref has > + * been through percpu_ref_kill() before devm_memremap_pages_release(). The > + * wait for the completion of kill and percpu_ref_exit() must occur after > + * devm_memremap_pages_release(). > * > * 2/ @res is expected to be a host memory range that could feasibly be > * treated as a "System RAM" range, i.e. not a device mmio range, but > @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, > */ > list_del(&page->lru); > page->pgmap = pgmap; > + percpu_ref_get(ref); > } > devres_add(dev, page_map); > return __va(res->start); > diff --git a/mm/swap.c b/mm/swap.c > index 5dabf444d724..01267dda6668 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page) > > void __put_page(struct page *page) > { > + if (is_zone_device_page(page)) { > + put_dev_pagemap(page->pgmap); > + > + /* > + * The page belong to device, do not return it to > + * page allocator. > + */ > + return; > + } > + > if (unlikely(PageCompound(page))) > __put_compound_page(page); > else > Forgive me if I'm missing something but this doesn't make sense to me. We are taking a reference once when the region is initialized and releasing it every time a page within the region's reference count drops to zero. That does not seem to be symmetric and I don't see how it tracks that pages are in use. Shouldn't get_dev_pagemap be called when any page is allocated or something like that (ie. the inverse of __put_page)? Thanks, Logan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-27 16:11 ` [PATCH] " Logan Gunthorpe @ 2017-04-27 16:14 ` Dan Williams 2017-04-27 16:33 ` Logan Gunthorpe 0 siblings, 1 reply; 43+ messages in thread From: Dan Williams @ 2017-04-27 16:14 UTC (permalink / raw) To: Logan Gunthorpe Cc: Andrew Morton, Linux MM, Jérôme Glisse, linux-kernel@vger.kernel.org, Kirill Shutemov On Thu, Apr 27, 2017 at 9:11 AM, Logan Gunthorpe <logang@deltatee.com> wrote: > > > On 26/04/17 06:55 PM, Dan Williams wrote: >> @@ -277,7 +269,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys) >> * >> * Notes: >> * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time >> - * (or devm release event). >> + * (or devm release event). The expected order of events is that @ref has >> + * been through percpu_ref_kill() before devm_memremap_pages_release(). The >> + * wait for the completion of kill and percpu_ref_exit() must occur after >> + * devm_memremap_pages_release(). >> * >> * 2/ @res is expected to be a host memory range that could feasibly be >> * treated as a "System RAM" range, i.e. not a device mmio range, but >> @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, >> */ >> list_del(&page->lru); >> page->pgmap = pgmap; >> + percpu_ref_get(ref); >> } >> devres_add(dev, page_map); >> return __va(res->start); >> diff --git a/mm/swap.c b/mm/swap.c >> index 5dabf444d724..01267dda6668 100644 >> --- a/mm/swap.c >> +++ b/mm/swap.c >> @@ -97,6 +97,16 @@ static void __put_compound_page(struct page *page) >> >> void __put_page(struct page *page) >> { >> + if (is_zone_device_page(page)) { >> + put_dev_pagemap(page->pgmap); >> + >> + /* >> + * The page belong to device, do not return it to >> + * page allocator. >> + */ >> + return; >> + } >> + >> if (unlikely(PageCompound(page))) >> __put_compound_page(page); >> else >> > > Forgive me if I'm missing something but this doesn't make sense to me. > We are taking a reference once when the region is initialized and > releasing it every time a page within the region's reference count drops > to zero. That does not seem to be symmetric and I don't see how it > tracks that pages are in use. Shouldn't get_dev_pagemap be called when > any page is allocated or something like that (ie. the inverse of > __put_page)? You're overlooking that the page reference count 1 after arch_add_memory(). So at the end of time we're just dropping the arch_add_memory() reference to release the page and related dev_pagemap. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-27 16:14 ` Dan Williams @ 2017-04-27 16:33 ` Logan Gunthorpe 2017-04-27 16:38 ` Dan Williams 0 siblings, 1 reply; 43+ messages in thread From: Logan Gunthorpe @ 2017-04-27 16:33 UTC (permalink / raw) To: Dan Williams Cc: Andrew Morton, Linux MM, Jérôme Glisse, linux-kernel@vger.kernel.org, Kirill Shutemov On 27/04/17 10:14 AM, Dan Williams wrote: > You're overlooking that the page reference count 1 after > arch_add_memory(). So at the end of time we're just dropping the > arch_add_memory() reference to release the page and related > dev_pagemap. Thanks, that does actually make a lot more sense to me now. However, there still appears to be an asymmetry in that the pgmap->ref is incremented once and decremented once per page... Logan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-27 16:33 ` Logan Gunthorpe @ 2017-04-27 16:38 ` Dan Williams 2017-04-27 16:45 ` Logan Gunthorpe 0 siblings, 1 reply; 43+ messages in thread From: Dan Williams @ 2017-04-27 16:38 UTC (permalink / raw) To: Logan Gunthorpe Cc: Andrew Morton, Linux MM, Jérôme Glisse, linux-kernel@vger.kernel.org, Kirill Shutemov On Thu, Apr 27, 2017 at 9:33 AM, Logan Gunthorpe <logang@deltatee.com> wrote: > > > On 27/04/17 10:14 AM, Dan Williams wrote: >> You're overlooking that the page reference count 1 after >> arch_add_memory(). So at the end of time we're just dropping the >> arch_add_memory() reference to release the page and related >> dev_pagemap. > > Thanks, that does actually make a lot more sense to me now. However, > there still appears to be an asymmetry in that the pgmap->ref is > incremented once and decremented once per page... > No, this hunk... @@ -379,6 +374,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, */ list_del(&page->lru); page->pgmap = pgmap; + percpu_ref_get(ref); } ...is inside a for_each_device_pfn() loop. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-27 16:38 ` Dan Williams @ 2017-04-27 16:45 ` Logan Gunthorpe 2017-04-27 16:46 ` Dan Williams 0 siblings, 1 reply; 43+ messages in thread From: Logan Gunthorpe @ 2017-04-27 16:45 UTC (permalink / raw) To: Dan Williams Cc: Andrew Morton, Linux MM, Jérôme Glisse, linux-kernel@vger.kernel.org, Kirill Shutemov On 27/04/17 10:38 AM, Dan Williams wrote: > ...is inside a for_each_device_pfn() loop. > Ah, oops. Then that makes perfect sense. Thanks. You may have my review tag if you'd like: Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Logan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference 2017-04-27 16:45 ` Logan Gunthorpe @ 2017-04-27 16:46 ` Dan Williams 0 siblings, 0 replies; 43+ messages in thread From: Dan Williams @ 2017-04-27 16:46 UTC (permalink / raw) To: Logan Gunthorpe Cc: Andrew Morton, Linux MM, Jérôme Glisse, linux-kernel@vger.kernel.org, Kirill Shutemov On Thu, Apr 27, 2017 at 9:45 AM, Logan Gunthorpe <logang@deltatee.com> wrote: > > > On 27/04/17 10:38 AM, Dan Williams wrote: >> ...is inside a for_each_device_pfn() loop. >> > > Ah, oops. Then that makes perfect sense. Thanks. > > You may have my review tag if you'd like: > > Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2017-05-02 13:23 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAA9_cmf7=aGXKoQFkzS_UJtznfRtWofitDpV2AyGwpaRGKyQkg@mail.gmail.com>
2017-04-23 23:31 ` get_zone_device_page() in get_page() and page_cache_get_speculative() Kirill A. Shutemov
2017-04-24 17:23   ` Dan Williams
2017-04-24 17:30     ` Kirill A. Shutemov
2017-04-24 17:47       ` Dan Williams
2017-04-24 18:01         ` Kirill A. Shutemov
2017-04-24 18:25           ` Kirill A. Shutemov
2017-04-24 18:41             ` Dan Williams
2017-04-25 13:19               ` Kirill A. Shutemov
2017-04-25 16:44                 ` Dan Williams
2017-04-27  0:55   ` [PATCH] mm, zone_device: replace {get, put}_zone_device_page() with a single reference Dan Williams
2017-04-27  8:33     ` Kirill A. Shutemov
2017-04-28  6:39       ` Ingo Molnar
2017-04-28  8:14         ` [PATCH] mm, zone_device: Replace " Kirill A. Shutemov
2017-04-28 17:23         ` [PATCH v2] mm, zone_device: replace " Dan Williams
2017-04-28 17:34           ` Jerome Glisse
2017-04-28 17:41             ` Dan Williams
2017-04-28 18:00               ` Jerome Glisse
2017-04-28 19:02                 ` Dan Williams
2017-04-28 19:16                   ` Jerome Glisse
2017-04-28 19:22                     ` Dan Williams
2017-04-28 19:33                       ` Jerome Glisse
2017-04-29 10:17                         ` Kirill A. Shutemov
2017-04-30 23:14                           ` Jerome Glisse
2017-05-01  1:42                             ` Dan Williams
2017-05-01  1:54                               ` Jerome Glisse
2017-05-01  2:40                                 ` Dan Williams
2017-05-01  3:48                             ` Logan Gunthorpe
2017-05-01 10:23                             ` Kirill A. Shutemov
2017-05-01 13:55                               ` Jerome Glisse
2017-05-01 20:19                                 ` Dan Williams
2017-05-01 20:32                                   ` Jerome Glisse
2017-05-02 11:37                                 ` Kirill A. Shutemov
2017-05-02 13:22                                   ` Jerome Glisse
2017-04-29 14:18           ` Ingo Molnar
2017-05-01  2:45             ` Dan Williams
2017-05-01  7:12               ` Ingo Molnar
2017-05-01  9:33                 ` Kirill A. Shutemov
2017-04-27 16:11     ` [PATCH] " Logan Gunthorpe
2017-04-27 16:14       ` Dan Williams
2017-04-27 16:33         ` Logan Gunthorpe
2017-04-27 16:38           ` Dan Williams
2017-04-27 16:45             ` Logan Gunthorpe
2017-04-27 16:46               ` Dan Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).