From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Date: Mon, 4 Feb 2013 18:04:06 +0800 Message-ID: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Return-path: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Currently get_user_pages() always tries to allocate pages from movable zone, as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users of get_user_pages() is easy to pin user pages for a long time(for now we found that pages pinned as aio ring pages is such case), which is fatal for memory hotplug/remove framework. So the 1st patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. The 2nd patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works when configed with CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). Lin Feng (2): mm: hotplug: implement non-movable version of get_user_pages() fs/aio.c: use non-movable version of get_user_pages() to pin ring pages when support memory hotremove fs/aio.c | 6 +++++ include/linux/mm.h | 5 ++++ include/linux/mmzone.h | 4 ++++ mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 ++++ 5 files changed, 83 insertions(+) -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 4 Feb 2013 18:04:07 +0800 Message-ID: <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Return-path: In-Reply-To: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org get_user_pages() always tries to allocate pages from movable zone, which is not reliable to memory hotremove framework in some case. This patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. Cc: Andrew Morton Cc: Mel Gorman Cc: KAMEZAWA Hiroyuki Cc: Yasuaki Ishimatsu Reviewed-by: Tang Chen Reviewed-by: Gu Zheng Signed-off-by: Lin Feng --- include/linux/mm.h | 5 ++++ include/linux/mmzone.h | 4 ++++ mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 ++++ 4 files changed, 77 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 66e2f7c..2a25d0e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas); +#endif struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 73b64a3..5db811e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) return (idx == ZONE_NORMAL); } +static inline int is_movable(struct zone *zone) +{ + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; +} /** * is_highmem - helper function to quickly check if a struct zone is a * highmem zone or not. This is an attempt to keep references diff --git a/mm/memory.c b/mm/memory.c index bb1369f..e3b8e19 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -58,6 +58,8 @@ #include #include #include +#include +#include #include #include @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +/** + * It's a wrapper of get_user_pages() but it makes sure that all pages come from + * non-movable zone via additional page migration. + */ +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas) +{ + int ret, i, isolate_err, migrate_pre_flag; + LIST_HEAD(pagelist); + +retry: + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, + vmas); + + isolate_err = 0; + migrate_pre_flag = 0; + + for (i = 0; i < ret; i++) { + if (is_movable(page_zone(pages[i]))) { + if (!migrate_pre_flag) { + if (migrate_prep()) + goto put_page; + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + + page_is_file_cache(pages[i])); + list_add_tail(&pages[i]->lru, &pagelist); + } else { + isolate_err = 1; + goto put_page; + } + } + } + + /* All pages are non movable, we are done :) */ + if (i == ret && list_empty(&pagelist)) + return ret; + +put_page: + /* Undo the effects of former get_user_pages(), we won't pin anything */ + for (i = 0; i < ret; i++) + put_page(pages[i]); + + if (migrate_pre_flag && !isolate_err) { + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, + false, MIGRATE_SYNC, MR_SYSCALL); + /* Steal pages from non-movable zone successfully? */ + if (!ret) + goto retry; + } + + putback_lru_pages(&pagelist); + return 0; +} +EXPORT_SYMBOL(get_user_pages_non_movable); +#endif + /** * get_dump_page() - pin user page in memory while writing it to core dump * @addr: user address diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 383bdbb..1b7bd17 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, return ret ? 0 : -EBUSY; } +/** + * @private: 0 means page can be alloced from movable zone, otherwise forbidden + */ struct page *alloc_migrate_target(struct page *page, unsigned long private, int **resultp) { @@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private, if (PageHighMem(page)) gfp_mask |= __GFP_HIGHMEM; + if (unlikely(private != 0)) + gfp_mask &= ~__GFP_MOVABLE; return alloc_page(gfp_mask); } -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Date: Mon, 4 Feb 2013 18:04:08 +0800 Message-ID: <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Return-path: In-Reply-To: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org This patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works as configed with CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). Cc: Benjamin LaHaise Cc: Alexander Viro Cc: Andrew Morton Reviewed-by: Tang Chen Reviewed-by: Gu Zheng Signed-off-by: Lin Feng --- fs/aio.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/aio.c b/fs/aio.c index 71f613c..0e9b30a 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) } dprintk("mmap address: 0x%08lx\n", info->mmap_base); +#ifdef CONFIG_MEMORY_HOTREMOVE + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, + info->mmap_base, nr_pages, + 1, 0, info->ring_pages, NULL); +#else info->nr_pages = get_user_pages(current, ctx->mm, info->mmap_base, nr_pages, 1, 0, info->ring_pages, NULL); +#endif up_write(&ctx->mm->mmap_sem); if (unlikely(info->nr_pages != nr_pages)) { -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Moyer Subject: Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Date: Mon, 04 Feb 2013 10:18:03 -0500 Message-ID: References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Lin Feng Return-path: In-Reply-To: <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> (Lin Feng's message of "Mon, 4 Feb 2013 18:04:08 +0800") Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Lin Feng writes: > This patch gets around the aio ring pages can't be migrated bug caused by > get_user_pages() via using the new function. It only works as configed with > CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > > Cc: Benjamin LaHaise > Cc: Alexander Viro > Cc: Andrew Morton > Reviewed-by: Tang Chen > Reviewed-by: Gu Zheng > Signed-off-by: Lin Feng > --- > fs/aio.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/fs/aio.c b/fs/aio.c > index 71f613c..0e9b30a 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) > } > > dprintk("mmap address: 0x%08lx\n", info->mmap_base); > +#ifdef CONFIG_MEMORY_HOTREMOVE > + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, > + info->mmap_base, nr_pages, > + 1, 0, info->ring_pages, NULL); > +#else > info->nr_pages = get_user_pages(current, ctx->mm, > info->mmap_base, nr_pages, > 1, 0, info->ring_pages, NULL); > +#endif Can't you hide this in your 1/1 patch, by providing this function as just a static inline wrapper around get_user_pages when CONFIG_MEMORY_HOTREMOVE is not enabled? Cheers, Jeff -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Zach Brown Subject: Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Date: Mon, 4 Feb 2013 15:02:09 -0800 Message-ID: <20130204230209.GK14246@lenny.home.zabbo.net> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Lin Feng , akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Jeff Moyer Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org > > index 71f613c..0e9b30a 100644 > > --- a/fs/aio.c > > +++ b/fs/aio.c > > @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) > > } > > > > dprintk("mmap address: 0x%08lx\n", info->mmap_base); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, > > + info->mmap_base, nr_pages, > > + 1, 0, info->ring_pages, NULL); > > +#else > > info->nr_pages = get_user_pages(current, ctx->mm, > > info->mmap_base, nr_pages, > > 1, 0, info->ring_pages, NULL); > > +#endif > > Can't you hide this in your 1/1 patch, by providing this function as > just a static inline wrapper around get_user_pages when > CONFIG_MEMORY_HOTREMOVE is not enabled? Yes, please. Having callers duplicate the call site for a single optional boolean input is unacceptable. But do we want another input argument as a name? Should aio have been using get_user_pages_fast()? (and so now _fast_non_movable?) I wonder if it's time to offer the booleans as a _flags() variant, much like the current internal flags for __get_user_pages(). The write and force arguments are already booleans, we have a different fast api, and now we're adding non-movable. The NON_MOVABLE flag would be 0 without MEMORY_HOTREMOVE, easy peasy. Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might also be nice :). No? - z -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 4 Feb 2013 16:06:24 -0800 Message-ID: <20130204160624.5c20a8a0.akpm@linux-foundation.org> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Lin Feng Return-path: In-Reply-To: <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org melreadthis On Mon, 4 Feb 2013 18:04:07 +0800 Lin Feng wrote: > get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > > This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > ... > > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > struct page **pages, struct vm_area_struct **vmas); > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages); > +#ifdef CONFIG_MEMORY_HOTREMOVE > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, int nr_pages, int write, int force, > + struct page **pages, struct vm_area_struct **vmas); > +#endif The ifdefs aren't really needed here and I encourage people to omit them. This keeps the header files looking neater and reduces the chances of things later breaking because we forgot to update some CONFIG_foo logic in a header file. The downside is that errors will be revealed at link time rather than at compile time, but that's a pretty small cost. > struct kvec; > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > struct page **pages); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 73b64a3..5db811e 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > return (idx == ZONE_NORMAL); > } > > +static inline int is_movable(struct zone *zone) > +{ > + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > +} A better name would be zone_is_movable(). We haven't been very consistent about this in mmzone.h, but zone_is_foo() is pretty common. And a neater implementation would be return zone_idx(zone) == ZONE_MOVABLE; All of which made me look at mmzone.h, and what I saw wasn't very nice :( Please review... From: Andrew Morton Subject: include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - use zone_idx() in is_highmem() to remove large amounts of silly fluff. Cc: Lin Feng Cc: Mel Gorman Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups +++ a/include/linux/mmzone.h @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone->zone_pgdat->node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon static inline int is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || - (zone_off == ZONE_MOVABLE * sizeof(*zone) && - zone_movable_is_highmem()); + enum zone_type idx = zone_idx(zone); + + return idx == ZONE_HIGHMEM || + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); #else return 0; #endif _ > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -58,6 +58,8 @@ > #include > #include > #include > +#include > +#include > #include > > #include > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > } > EXPORT_SYMBOL(get_user_pages); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > +/** > + * It's a wrapper of get_user_pages() but it makes sure that all pages come from > + * non-movable zone via additional page migration. > + */ This needs a description of why the function exists - say something about the requirements of memory hotplug. Also a few words describing how the function works would be good. > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, int nr_pages, int write, int force, > + struct page **pages, struct vm_area_struct **vmas) > +{ > + int ret, i, isolate_err, migrate_pre_flag; > + LIST_HEAD(pagelist); > + > +retry: > + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, > + vmas); We should handle (ret < 0) here. At present the function will silently convert an error return into "return 0", which is bad. The function does appear to otherwise do the right thing if get_user_pages() failed, but only due to good luck. > + isolate_err = 0; > + migrate_pre_flag = 0; > + > + for (i = 0; i < ret; i++) { > + if (is_movable(page_zone(pages[i]))) { > + if (!migrate_pre_flag) { > + if (migrate_prep()) > + goto put_page; > + migrate_pre_flag = 1; > + } > + > + if (!isolate_lru_page(pages[i])) { > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > + page_is_file_cache(pages[i])); > + list_add_tail(&pages[i]->lru, &pagelist); > + } else { > + isolate_err = 1; > + goto put_page; > + } > + } > + } > + > + /* All pages are non movable, we are done :) */ > + if (i == ret && list_empty(&pagelist)) > + return ret; > + > +put_page: > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > + for (i = 0; i < ret; i++) > + put_page(pages[i]); > + > + if (migrate_pre_flag && !isolate_err) { > + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > + false, MIGRATE_SYNC, MR_SYSCALL); > + /* Steal pages from non-movable zone successfully? */ > + if (!ret) > + goto retry; This is buggy. migrate_pages() doesn't empty its `from' argument, so page_list must be reinitialised here (or, better, at the start of the loop). Mel, what's up with migrate_pages()? Shouldn't it be removing the pages from the list when MIGRATEPAGE_SUCCESS? The use of list_for_each_entry_safe() suggests we used to do that... > + } > + > + putback_lru_pages(&pagelist); > + return 0; > +} > +EXPORT_SYMBOL(get_user_pages_non_movable); > +#endif > + Generally, I'd like Mel to go through (the next version of) this patch carefully, please. -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 4 Feb 2013 16:18:48 -0800 Message-ID: <20130204161848.bfd36176.akpm@linux-foundation.org> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit To: Lin Feng , mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Return-path: In-Reply-To: <20130204160624.5c20a8a0.akpm@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Also... On Mon, 4 Feb 2013 16:06:24 -0800 Andrew Morton wrote: > > +put_page: > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > > + for (i = 0; i < ret; i++) > > + put_page(pages[i]); We can use release_pages() here. release_pages() is designed to be more efficient when we're putting the final reference to (most of) the pages. It probably has little if any benefit when putting still-in-use pages, as we're doing here. But please consider... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Date: Tue, 5 Feb 2013 09:58:59 +0900 Message-ID: <20130205005859.GE2610@blaptop> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Lin Feng Return-path: Content-Disposition: inline In-Reply-To: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hello, On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > Currently get_user_pages() always tries to allocate pages from movable zone, > as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > of get_user_pages() is easy to pin user pages for a long time(for now we found > that pages pinned as aio ring pages is such case), which is fatal for memory > hotplug/remove framework. > > So the 1st patch introduces a new library function called > get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > The 2nd patch gets around the aio ring pages can't be migrated bug caused by > get_user_pages() via using the new function. It only works when configed with > CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). CMA has same issue but the problem is the driver developers or any subsystem using GUP can't know their pages is in CMA area or not in advance. So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? Even some driver module in embedded side doesn't open their source code. I would like to make GUP smart so it allocates a page from non-movable/non-cma area when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :( > > Lin Feng (2): > mm: hotplug: implement non-movable version of get_user_pages() > fs/aio.c: use non-movable version of get_user_pages() to pin ring > pages when support memory hotremove > > fs/aio.c | 6 +++++ > include/linux/mm.h | 5 ++++ > include/linux/mmzone.h | 4 ++++ > mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/page_isolation.c | 5 ++++ > 5 files changed, 83 insertions(+) > > -- > 1.7.11.7 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 05 Feb 2013 11:09:27 +0800 Message-ID: <511077E7.4040605@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Tang chen , Gu Zheng To: Andrew Morton Return-path: In-Reply-To: <20130204160624.5c20a8a0.akpm@linux-foundation.org> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Andrew, On 02/05/2013 08:06 AM, Andrew Morton wrote: > > melreadthis > > On Mon, 4 Feb 2013 18:04:07 +0800 > Lin Feng wrote: > >> get_user_pages() always tries to allocate pages from movable zone, which is not >> reliable to memory hotremove framework in some case. >> >> This patch introduces a new library function called get_user_pages_non_movable() >> to pin pages only from zone non-movable in memory. >> It's a wrapper of get_user_pages() but it makes sure that all pages come from >> non-movable zone via additional page migration. >> >> ... >> >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, >> struct page **pages, struct vm_area_struct **vmas); >> int get_user_pages_fast(unsigned long start, int nr_pages, int write, >> struct page **pages); >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >> + unsigned long start, int nr_pages, int write, int force, >> + struct page **pages, struct vm_area_struct **vmas); >> +#endif > > The ifdefs aren't really needed here and I encourage people to omit > them. This keeps the header files looking neater and reduces the > chances of things later breaking because we forgot to update some > CONFIG_foo logic in a header file. The downside is that errors will be > revealed at link time rather than at compile time, but that's a pretty > small cost. OK, got it, thanks :) > >> struct kvec; >> int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, >> struct page **pages); >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 73b64a3..5db811e 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) >> return (idx == ZONE_NORMAL); >> } >> >> +static inline int is_movable(struct zone *zone) >> +{ >> + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; >> +} > > A better name would be zone_is_movable(). We haven't been very > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > Yes, zone_is_xxx() should be a better name, but there are some analogous definition like is_dma32() and is_normal() etc, if we only use zone_is_movable() for movable zone it will break such naming rules. Should we update other ones in a separate patch later or just keep the old style? > And a neater implementation would be > > return zone_idx(zone) == ZONE_MOVABLE; > OK. After your change, should we also update for other ones such as is_normal()..? > All of which made me look at mmzone.h, and what I saw wasn't very nice :( > > Please review... > > > From: Andrew Morton > Subject: include/linux/mmzone.h: cleanups > > - implement zone_idx() in C to fix its references-args-twice macro bug > > - use zone_idx() in is_highmem() to remove large amounts of silly fluff. > > Cc: Lin Feng > Cc: Mel Gorman > Signed-off-by: Andrew Morton > --- > > include/linux/mmzone.h | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h > --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups > +++ a/include/linux/mmzone.h > @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by > /* > * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. > */ > -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) > +static inline enum zone_type zone_idx(struct zone *zone) > +{ > + return zone - zone->zone_pgdat->node_zones; > +} > > static inline int populated_zone(struct zone *zone) > { > @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon > static inline int is_highmem(struct zone *zone) > { > #ifdef CONFIG_HIGHMEM > - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || > - (zone_off == ZONE_MOVABLE * sizeof(*zone) && > - zone_movable_is_highmem()); > + enum zone_type idx = zone_idx(zone); > + > + return idx == ZONE_HIGHMEM || > + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); > #else > return 0; > #endif > _ > > >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -58,6 +58,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> #include >> >> #include >> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, >> } >> EXPORT_SYMBOL(get_user_pages); >> >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> +/** >> + * It's a wrapper of get_user_pages() but it makes sure that all pages come from >> + * non-movable zone via additional page migration. >> + */ > > This needs a description of why the function exists - say something > about the requirements of memory hotplug. > > Also a few words describing how the function works would be good. OK, I will add them in next version. > >> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >> + unsigned long start, int nr_pages, int write, int force, >> + struct page **pages, struct vm_area_struct **vmas) >> +{ >> + int ret, i, isolate_err, migrate_pre_flag; >> + LIST_HEAD(pagelist); >> + >> +retry: >> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >> + vmas); > > We should handle (ret < 0) here. At present the function will silently > convert an error return into "return 0", which is bad. The function > does appear to otherwise do the right thing if get_user_pages() failed, > but only due to good luck. Sorry, I forgot this, thanks. > >> + isolate_err = 0; >> + migrate_pre_flag = 0; >> + >> + for (i = 0; i < ret; i++) { >> + if (is_movable(page_zone(pages[i]))) { >> + if (!migrate_pre_flag) { >> + if (migrate_prep()) >> + goto put_page; >> + migrate_pre_flag = 1; >> + } >> + >> + if (!isolate_lru_page(pages[i])) { >> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >> + page_is_file_cache(pages[i])); >> + list_add_tail(&pages[i]->lru, &pagelist); >> + } else { >> + isolate_err = 1; >> + goto put_page; >> + } >> + } >> + } >> + >> + /* All pages are non movable, we are done :) */ >> + if (i == ret && list_empty(&pagelist)) >> + return ret; >> + >> +put_page: >> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >> + for (i = 0; i < ret; i++) >> + put_page(pages[i]); >> + >> + if (migrate_pre_flag && !isolate_err) { >> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >> + false, MIGRATE_SYNC, MR_SYSCALL); >> + /* Steal pages from non-movable zone successfully? */ >> + if (!ret) >> + goto retry; > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > page_list must be reinitialised here (or, better, at the start of the loop). migrate_pages() list_for_each_entry_safe() unmap_and_move() if (rc != -EAGAIN) { list_del(&page->lru); } IUUC, the page migrated successfully will be remove from the `from` list here. So if all pages having been migrated successlly, the list will be empty, and we goto retry. Otherwise the rest of the immigrated pages will be handled later by putback_lru_pages(&pagelist), and the function just return. Also there are many places using such logic: 1274 nr_failed = migrate_pages(&pagelist, new_vma_page, 1275 (unsigned long)vma, 1276 false, MIGRATE_SYNC, 1277 MR_MEMPOLICY_MBIND); 1278 if (nr_failed) 1279 putback_lru_pages(&pagelist); Since we traverse the list and handle each page separately, It's likely that we have migrated some page successfully before we encounter a failure. If the pagelist is still carrying the successfully migrated pages when migrate_pages() return, such code is bugyy. > > Mel, what's up with migrate_pages()? Shouldn't it be removing the > pages from the list when MIGRATEPAGE_SUCCESS? The use of > list_for_each_entry_safe() suggests we used to do that... > >> + } >> + >> + putback_lru_pages(&pagelist); >> + return 0; >> +} >> +EXPORT_SYMBOL(get_user_pages_non_movable); >> +#endif >> + > > Generally, I'd like Mel to go through (the next version of) this patch > carefully, please. > > On 02/05/2013 08:18 AM, Andrew Morton wrote:> On Mon, 4 Feb 2013 16:06:24 -0800 > Andrew Morton wrote: > >>> > > +put_page: >>> > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> > > + for (i = 0; i < ret; i++) >>> > > + put_page(pages[i]); > We can use release_pages() here. > > release_pages() is designed to be more efficient when we're putting the > final reference to (most of) the pages. It probably has little if any > benefit when putting still-in-use pages, as we're doing here. > > But please consider... OK, I will try :) > thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Date: Tue, 05 Feb 2013 12:42:48 +0800 Message-ID: <51108DC8.4090704@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Minchan Kim Return-path: In-Reply-To: <20130205005859.GE2610@blaptop> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Minchan, On 02/05/2013 08:58 AM, Minchan Kim wrote: > Hello, > > On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: >> Currently get_user_pages() always tries to allocate pages from movable zone, >> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users >> of get_user_pages() is easy to pin user pages for a long time(for now we found >> that pages pinned as aio ring pages is such case), which is fatal for memory >> hotplug/remove framework. >> >> So the 1st patch introduces a new library function called >> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. >> It's a wrapper of get_user_pages() but it makes sure that all pages come from >> non-movable zone via additional page migration. >> >> The 2nd patch gets around the aio ring pages can't be migrated bug caused by >> get_user_pages() via using the new function. It only works when configed with >> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > > CMA has same issue but the problem is the driver developers or any subsystem > using GUP can't know their pages is in CMA area or not in advance. > So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? > Even some driver module in embedded side doesn't open their source code. Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users of GUP, they will release the pinned pages immediately and to such users they should get a good performance, using the old style interface is a smart way. And we had better just deal with the cases we have to by using the new interface. > > I would like to make GUP smart so it allocates a page from non-movable/non-cma area > when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP > performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :( As I debuged the get_user_pages(), I found that some pages is already there and may be allocated before we call get_user_pages(). __get_user_pages() have following logic to handle such case. 1786 while (!(page = follow_page(vma, start, foll_flags))) { 1787 int ret; To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP as smart as we want :( , so I worked out the migration approach to get around and avoid messing up the current code :) thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Date: Tue, 05 Feb 2013 13:06:26 +0800 Message-ID: <51109352.9070401@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Tang chen , Gu Zheng To: Jeff Moyer Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Jeff, On 02/04/2013 11:18 PM, Jeff Moyer wrote: >> --- >> fs/aio.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/fs/aio.c b/fs/aio.c >> index 71f613c..0e9b30a 100644 >> --- a/fs/aio.c >> +++ b/fs/aio.c >> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) >> } >> >> dprintk("mmap address: 0x%08lx\n", info->mmap_base); >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, >> + info->mmap_base, nr_pages, >> + 1, 0, info->ring_pages, NULL); >> +#else >> info->nr_pages = get_user_pages(current, ctx->mm, >> info->mmap_base, nr_pages, >> 1, 0, info->ring_pages, NULL); >> +#endif > > Can't you hide this in your 1/1 patch, by providing this function as > just a static inline wrapper around get_user_pages when > CONFIG_MEMORY_HOTREMOVE is not enabled? Good idea, it makes the callers more neatly :) thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Date: Tue, 5 Feb 2013 14:25:17 +0900 Message-ID: <20130205052517.GH2610@blaptop> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> <51108DC8.4090704@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Lin Feng Return-path: Content-Disposition: inline In-Reply-To: <51108DC8.4090704@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Lin, On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote: > Hi Minchan, > > On 02/05/2013 08:58 AM, Minchan Kim wrote: > > Hello, > > > > On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > >> Currently get_user_pages() always tries to allocate pages from movable zone, > >> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > >> of get_user_pages() is easy to pin user pages for a long time(for now we found > >> that pages pinned as aio ring pages is such case), which is fatal for memory > >> hotplug/remove framework. > >> > >> So the 1st patch introduces a new library function called > >> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > >> It's a wrapper of get_user_pages() but it makes sure that all pages come from > >> non-movable zone via additional page migration. > >> > >> The 2nd patch gets around the aio ring pages can't be migrated bug caused by > >> get_user_pages() via using the new function. It only works when configed with > >> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > > > > CMA has same issue but the problem is the driver developers or any subsystem > > using GUP can't know their pages is in CMA area or not in advance. > > So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? > > Even some driver module in embedded side doesn't open their source code. > Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users > of GUP, they will release the pinned pages immediately and to such users they should get > a good performance, using the old style interface is a smart way. And we had better just > deal with the cases we have to by using the new interface. Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know in advance they will use pinned pages long time or release in short time because it depends on some event like user's response which is very not predetermined. So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of failing migration by page count and then, investigate they are really using GUP and it's REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? Yes. it could be done but it would be rather trobulesome job. > > > > > I would like to make GUP smart so it allocates a page from non-movable/non-cma area > > when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP > > performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :( > As I debuged the get_user_pages(), I found that some pages is already there and may be > allocated before we call get_user_pages(). __get_user_pages() have following logic to > handle such case. > 1786 while (!(page = follow_page(vma, start, foll_flags))) { > 1787 int ret; > To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP > as smart as we want :( , so I worked out the migration approach to get around and > avoid messing up the current code :) I didn't look at your code in detail yet but What's wrong following this? Change curret GUP with old_GUP. #ifdef CONFIG_MIGRATE_ISOLATE int get_user_pages_non_movable() { .. old_get_user_pages() .. } int get_user_pages() { return get_user_pages_non_movable(); } #else int get_user_pages() { return old_get_user_pages() } #endif > > thanks, > linfeng > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Date: Tue, 05 Feb 2013 13:35:52 +0800 Message-ID: <51109A38.7050605@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> <20130204230209.GK14246@lenny.home.zabbo.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Jeff Moyer , akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Tang chen , Gu Zheng To: Zach Brown Return-path: In-Reply-To: <20130204230209.GK14246@lenny.home.zabbo.net> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Zach, On 02/05/2013 07:02 AM, Zach Brown wrote: >>> index 71f613c..0e9b30a 100644 >>> --- a/fs/aio.c >>> +++ b/fs/aio.c >>> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) >>> } >>> >>> dprintk("mmap address: 0x%08lx\n", info->mmap_base); >>> +#ifdef CONFIG_MEMORY_HOTREMOVE >>> + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, >>> + info->mmap_base, nr_pages, >>> + 1, 0, info->ring_pages, NULL); >>> +#else >>> info->nr_pages = get_user_pages(current, ctx->mm, >>> info->mmap_base, nr_pages, >>> 1, 0, info->ring_pages, NULL); >>> +#endif >> >> Can't you hide this in your 1/1 patch, by providing this function as >> just a static inline wrapper around get_user_pages when >> CONFIG_MEMORY_HOTREMOVE is not enabled? > > Yes, please. Having callers duplicate the call site for a single > optional boolean input is unacceptable. I will deal with it in next version :) > > But do we want another input argument as a name? Should aio have been > using get_user_pages_fast()? (and so now _fast_non_movable?) > > I wonder if it's time to offer the booleans as a _flags() variant, much > like the current internal flags for __get_user_pages(). The write and > force arguments are already booleans, we have a different fast api, and > now we're adding non-movable. The NON_MOVABLE flag would be 0 without > MEMORY_HOTREMOVE, easy peasy. As my next reply-mail mentioned, IIUC in GUP case additional flags seems doesn't work, I abstract here: As I debuged the get_user_pages(), I found that some pages is already there and may be allocated before we call get_user_pages(). __get_user_pages() have following logic to handle such case. 1786 while (!(page = follow_page(vma, start, foll_flags))) { 1787 int ret; To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP as smart as we want , so I worked out the migration approach to get around and avoid messing up the current code. And even worse we have already got *8* arguments...Maybe we have to rework the boolean arguments into bit flags... It seems not a little work :( > > Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might > also be nice :). Agree, maybe we could handle them later :) thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Date: Tue, 05 Feb 2013 14:18:42 +0800 Message-ID: <5110A442.5000707@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> <51108DC8.4090704@cn.fujitsu.com> <20130205052517.GH2610@blaptop> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Minchan Kim Return-path: In-Reply-To: <20130205052517.GH2610@blaptop> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 02/05/2013 01:25 PM, Minchan Kim wrote: > Hi Lin, > > On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote: >> Hi Minchan, >> >> On 02/05/2013 08:58 AM, Minchan Kim wrote: >>> Hello, >>> >>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: >>>> Currently get_user_pages() always tries to allocate pages from movable zone, >>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users >>>> of get_user_pages() is easy to pin user pages for a long time(for now we found >>>> that pages pinned as aio ring pages is such case), which is fatal for memory >>>> hotplug/remove framework. >>>> >>>> So the 1st patch introduces a new library function called >>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. >>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from >>>> non-movable zone via additional page migration. >>>> >>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by >>>> get_user_pages() via using the new function. It only works when configed with >>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). >>> >>> CMA has same issue but the problem is the driver developers or any subsystem >>> using GUP can't know their pages is in CMA area or not in advance. >>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? >>> Even some driver module in embedded side doesn't open their source code. >> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users >> of GUP, they will release the pinned pages immediately and to such users they should get >> a good performance, using the old style interface is a smart way. And we had better just >> deal with the cases we have to by using the new interface. > > Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages > immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power > by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know > in advance they will use pinned pages long time or release in short time because it depends > on some event like user's response which is very not predetermined. > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > failing migration by page count and then, investigate they are really using GUP and it's > REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? > > Yes. it could be done but it would be rather trobulesome job. Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is a big job. If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the *non_movable* version of GUP. But in some server environment we want to keep the performance but also want to use hotremove feature in case. Maybe patch the place as we need is a trade off for such support. Mel also said in the last discussion: On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote: >> Well, that's a fairly low-level implementation detail. A more typical >> approach would be to add a new get_user_pages_non_movable() or such. >> That would probably have the same signature as get_user_pages(), with >> one additional argument. Then get_user_pages() becomes a one-line >> wrapper which passes in a particular value of that argument. >> > > That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE > allocations. That will impact THP availability by increasing the number > of MIGRATE_UNMOVABLE blocks that exist and it would hit every user -- > not just those that care about ZONE_MOVABLE. > > I'm likely to NAK such a patch if it's only about node hot-remove because > it's much more of a corner case than wanting to use THP. > > I would prefer if get_user_pages() checked if the page it was about to > pin was in ZONE_MOVABLE and if so, migrate it at that point before it's > pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability > if that's what they want. The CMA people might also want to take > advantage of this if the page happened to be in the MIGRATE_CMA > pageblock. > So it may not a good idea that we all fall into calling the *non_movable* version of GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think? > #ifdef CONFIG_MIGRATE_ISOLATE > > int get_user_pages_non_movable() > { > .. > old_get_user_pages() > .. > } > > int get_user_pages() > { > return get_user_pages_non_movable(); > } > #else > int get_user_pages() > { > return old_get_user_pages() > } > #endif thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Date: Tue, 5 Feb 2013 16:45:19 +0900 Message-ID: <20130205074519.GB11197@blaptop> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> <51108DC8.4090704@cn.fujitsu.com> <20130205052517.GH2610@blaptop> <5110A442.5000707@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Lin Feng Return-path: Content-Disposition: inline In-Reply-To: <5110A442.5000707@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, Feb 05, 2013 at 02:18:42PM +0800, Lin Feng wrote: > > > On 02/05/2013 01:25 PM, Minchan Kim wrote: > > Hi Lin, > > > > On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote: > >> Hi Minchan, > >> > >> On 02/05/2013 08:58 AM, Minchan Kim wrote: > >>> Hello, > >>> > >>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > >>>> Currently get_user_pages() always tries to allocate pages from movable zone, > >>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > >>>> of get_user_pages() is easy to pin user pages for a long time(for now we found > >>>> that pages pinned as aio ring pages is such case), which is fatal for memory > >>>> hotplug/remove framework. > >>>> > >>>> So the 1st patch introduces a new library function called > >>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > >>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from > >>>> non-movable zone via additional page migration. > >>>> > >>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by > >>>> get_user_pages() via using the new function. It only works when configed with > >>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > >>> > >>> CMA has same issue but the problem is the driver developers or any subsystem > >>> using GUP can't know their pages is in CMA area or not in advance. > >>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? > >>> Even some driver module in embedded side doesn't open their source code. > >> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users > >> of GUP, they will release the pinned pages immediately and to such users they should get > >> a good performance, using the old style interface is a smart way. And we had better just > >> deal with the cases we have to by using the new interface. > > > > Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages > > immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power > > by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know > > in advance they will use pinned pages long time or release in short time because it depends > > on some event like user's response which is very not predetermined. > > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > > failing migration by page count and then, investigate they are really using GUP and it's > > REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? > > > > Yes. it could be done but it would be rather trobulesome job. > Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is > a big job. > If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the > *non_movable* version of GUP. > But in some server environment we want to keep the performance but also want to use hotremove > feature in case. Maybe patch the place as we need is a trade off for such support. > > Mel also said in the last discussion: > > On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote: > >> Well, that's a fairly low-level implementation detail. A more typical > >> approach would be to add a new get_user_pages_non_movable() or such. > >> That would probably have the same signature as get_user_pages(), with > >> one additional argument. Then get_user_pages() becomes a one-line > >> wrapper which passes in a particular value of that argument. > >> > > > > That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE > > allocations. That will impact THP availability by increasing the number > > of MIGRATE_UNMOVABLE blocks that exist and it would hit every user -- > > not just those that care about ZONE_MOVABLE. > > > > I'm likely to NAK such a patch if it's only about node hot-remove because > > it's much more of a corner case than wanting to use THP. > > > > I would prefer if get_user_pages() checked if the page it was about to > > pin was in ZONE_MOVABLE and if so, migrate it at that point before it's > > pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability > > if that's what they want. The CMA people might also want to take > > advantage of this if the page happened to be in the MIGRATE_CMA > > pageblock. > > > > So it may not a good idea that we all fall into calling the *non_movable* version of > GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think? Frankly speaking, I can't understand Mel's comment. AFAIUC, he said GUP checks the page before get_page and if the page is movable zone, then migrate it out of movable zone and get_page again. That's exactly what I want. It doesn't introduce GUP_NM. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Date: Tue, 05 Feb 2013 16:27:57 +0800 Message-ID: <5110C28D.1040001@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> <51108DC8.4090704@cn.fujitsu.com> <20130205052517.GH2610@blaptop> <5110A442.5000707@cn.fujitsu.com> <20130205074519.GB11197@blaptop> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Minchan Kim Return-path: In-Reply-To: <20130205074519.GB11197@blaptop> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Minchan, On 02/05/2013 03:45 PM, Minchan Kim wrote: >> So it may not a good idea that we all fall into calling the *non_movable* version of >> > GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think? > Frankly speaking, I can't understand Mel's comment. > AFAIUC, he said GUP checks the page before get_page and if the page is movable zone, > then migrate it out of movable zone and get_page again. > That's exactly what I want. It doesn't introduce GUP_NM. Since an long time pin or not is an unpredictable behave except you know what the caller wants to do. We have to check every time we call GUP, and GUP may need another parameter to teach itself to make the right decision? We have already got *8* parameters :( thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 5 Feb 2013 11:57:22 +0000 Message-ID: <20130205115722.GF21389@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: Content-Disposition: inline In-Reply-To: <20130204160624.5c20a8a0.akpm@linux-foundation.org> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: > On Mon, 4 Feb 2013 18:04:07 +0800 > Lin Feng wrote: > > > get_user_pages() always tries to allocate pages from movable zone, which is not > > reliable to memory hotremove framework in some case. > > > > This patch introduces a new library function called get_user_pages_non_movable() > > to pin pages only from zone non-movable in memory. > > It's a wrapper of get_user_pages() but it makes sure that all pages come from > > non-movable zone via additional page migration. > > > > ... > > > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > struct page **pages, struct vm_area_struct **vmas); > > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > > struct page **pages); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > > + unsigned long start, int nr_pages, int write, int force, > > + struct page **pages, struct vm_area_struct **vmas); > > +#endif > > The ifdefs aren't really needed here and I encourage people to omit > them. This keeps the header files looking neater and reduces the > chances of things later breaking because we forgot to update some > CONFIG_foo logic in a header file. The downside is that errors will be > revealed at link time rather than at compile time, but that's a pretty > small cost. > As an aside, if ifdefs *have* to be used then it often better to have a pattern like #ifdef CONFIG_MEMORY_HOTREMOVE int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas); #else static inline get_user_pages_non_movable(...) { get_user_pages(...) } #endif It eliminates the need for #ifdefs in the C file that calls get_user_pages_non_movable(). > > struct kvec; > > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > > struct page **pages); > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 73b64a3..5db811e 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > > return (idx == ZONE_NORMAL); > > } > > > > +static inline int is_movable(struct zone *zone) > > +{ > > + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > > +} > > A better name would be zone_is_movable(). He is matching the names of the is_normal(), is_dma32() and is_dma() helpers. Unfortunately I would expect a name like is_movable() to return true if the page had either been allocated with __GFP_MOVABLE or if it was checking migrate can currently handle the page. is_movable() is indeed a terrible name. They should be renamed or deleted in preparation -- see below. > We haven't been very > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > > And a neater implementation would be > > return zone_idx(zone) == ZONE_MOVABLE; > Yes. > All of which made me look at mmzone.h, and what I saw wasn't very nice :( > > Please review... > > > From: Andrew Morton > Subject: include/linux/mmzone.h: cleanups > > - implement zone_idx() in C to fix its references-args-twice macro bug > > - use zone_idx() in is_highmem() to remove large amounts of silly fluff. > > Cc: Lin Feng > Cc: Mel Gorman > Signed-off-by: Andrew Morton > --- > > include/linux/mmzone.h | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h > --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups > +++ a/include/linux/mmzone.h > @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by > /* > * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. > */ > -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) > +static inline enum zone_type zone_idx(struct zone *zone) > +{ > + return zone - zone->zone_pgdat->node_zones; > +} > > static inline int populated_zone(struct zone *zone) > { > @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon > static inline int is_highmem(struct zone *zone) > { > #ifdef CONFIG_HIGHMEM > - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || > - (zone_off == ZONE_MOVABLE * sizeof(*zone) && > - zone_movable_is_highmem()); > + enum zone_type idx = zone_idx(zone); > + > + return idx == ZONE_HIGHMEM || > + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); > #else > return 0; > #endif *blinks*. Ok, while I think your version looks nicer, it is reverting ddc81ed2 (remove sparse warning for mmzone.h). According to my version of gcc at least, your patch reintroduces the sar and sparse complains if run as make ARCH=i386 C=2 CF=-Wsparse-all .... this? From: Mel Gorman Subject: [PATCH] include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - implement zone_is_idx in an attempt to explain what the fluff in is_highmem is for - rename is_highmem to zone_is_highmem as preparation for zone_is_movable() - implement zone_is_movable as a helper for places that care about ZONE_MOVABLE - Remove is_normal, is_dma32 and is_dma because apparently no one cares - convert users of zone_idx(zone) == ZONE_FOO to appropriate zone_is_foo() helper - Use is_highmem_idx() instead of is_highmem for the PageHighMem implementation. Should be functionally equivalent because we only care about the zones index, not exactly what zone it is [akpm@linux-foundation.org: zone_idx() reimplementation] Signed-off-by: Andrew Morton Signed-off-by: Mel Gorman --- arch/s390/mm/init.c | 2 +- arch/tile/mm/init.c | 5 +---- arch/x86/mm/highmem_32.c | 2 +- include/linux/mmzone.h | 49 +++++++++++++++----------------------------- include/linux/page-flags.h | 2 +- kernel/power/snapshot.c | 12 +++++------ mm/page_alloc.c | 6 +++--- 7 files changed, 30 insertions(+), 48 deletions(-) diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index 81e596c..c26c321 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -231,7 +231,7 @@ int arch_add_memory(int nid, u64 start, u64 size) if (rc) return rc; for_each_zone(zone) { - if (zone_idx(zone) != ZONE_MOVABLE) { + if (!zone_is_movable(zone)) { /* Add range within existing zone limits */ zone_start_pfn = zone->zone_start_pfn; zone_end_pfn = zone->zone_start_pfn + diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c index ef29d6c..8287749 100644 --- a/arch/tile/mm/init.c +++ b/arch/tile/mm/init.c @@ -733,9 +733,6 @@ static void __init set_non_bootmem_pages_init(void) for_each_zone(z) { unsigned long start, end; int nid = z->zone_pgdat->node_id; -#ifdef CONFIG_HIGHMEM - int idx = zone_idx(z); -#endif start = z->zone_start_pfn; end = start + z->spanned_pages; @@ -743,7 +740,7 @@ static void __init set_non_bootmem_pages_init(void) start = max(start, max_low_pfn); #ifdef CONFIG_HIGHMEM - if (idx == ZONE_HIGHMEM) + if (zone_is_highmem(z)) totalhigh_pages += z->spanned_pages; #endif if (kdata_huge) { diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c index 6f31ee5..63020e7 100644 --- a/arch/x86/mm/highmem_32.c +++ b/arch/x86/mm/highmem_32.c @@ -124,7 +124,7 @@ void __init set_highmem_pages_init(void) for_each_zone(zone) { unsigned long zone_start_pfn, zone_end_pfn; - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) continue; zone_start_pfn = zone->zone_start_pfn; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a23923b..5ecef75 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -779,10 +779,20 @@ static inline int local_memory_node(int node_id) { return node_id; }; unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long); #endif +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) +{ + /* This mess avoids a potentially expensive pointer subtraction. */ + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; + return zone_off == idx * sizeof(*zone); +} + /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone->zone_pgdat->node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -810,50 +820,25 @@ static inline int is_highmem_idx(enum zone_type idx) #endif } -static inline int is_normal_idx(enum zone_type idx) -{ - return (idx == ZONE_NORMAL); -} - /** - * is_highmem - helper function to quickly check if a struct zone is a + * zone_is_highmem - helper function to quickly check if a struct zone is a * highmem zone or not. This is an attempt to keep references * to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum. * @zone - pointer to struct zone variable */ -static inline int is_highmem(struct zone *zone) +static inline bool zone_is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || - (zone_off == ZONE_MOVABLE * sizeof(*zone) && - zone_movable_is_highmem()); -#else - return 0; -#endif -} - -static inline int is_normal(struct zone *zone) -{ - return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL; -} - -static inline int is_dma32(struct zone *zone) -{ -#ifdef CONFIG_ZONE_DMA32 - return zone == zone->zone_pgdat->node_zones + ZONE_DMA32; + return zone_is_idx(zone, ZONE_HIGHMEM) || + (zone_is_idx(zone, ZONE_MOVABLE) && zone_movable_is_highmem()); #else return 0; #endif } -static inline int is_dma(struct zone *zone) +static inline bool zone_is_movable(struct zone *zone) { -#ifdef CONFIG_ZONE_DMA - return zone == zone->zone_pgdat->node_zones + ZONE_DMA; -#else - return 0; -#endif + return zone_is_idx(zone, ZONE_MOVABLE); } /* These two functions are used to setup the per zone pages min values */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index b5d1384..2f541f1 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -237,7 +237,7 @@ PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */ * Must use a macro here due to header dependency issues. page_zone() is not * available at this point. */ -#define PageHighMem(__p) is_highmem(page_zone(__p)) +#define PageHighMem(__p) is_highmem_idx(page_zonenum(__p)) #else PAGEFLAG_FALSE(HighMem) #endif diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 0de2857..cbf0da9 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -830,7 +830,7 @@ static unsigned int count_free_highmem_pages(void) unsigned int cnt = 0; for_each_populated_zone(zone) - if (is_highmem(zone)) + if (zone_is_highmem(zone)) cnt += zone_page_state(zone, NR_FREE_PAGES); return cnt; @@ -879,7 +879,7 @@ static unsigned int count_highmem_pages(void) for_each_populated_zone(zone) { unsigned long pfn, max_zone_pfn; - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) continue; mark_free_pages(zone); @@ -943,7 +943,7 @@ static unsigned int count_data_pages(void) unsigned int n = 0; for_each_populated_zone(zone) { - if (is_highmem(zone)) + if (zone_is_highmem(zone)) continue; mark_free_pages(zone); @@ -989,7 +989,7 @@ static void safe_copy_page(void *dst, struct page *s_page) static inline struct page * page_is_saveable(struct zone *zone, unsigned long pfn) { - return is_highmem(zone) ? + return zone_is_highmem(zone) ? saveable_highmem_page(zone, pfn) : saveable_page(zone, pfn); } @@ -1338,7 +1338,7 @@ int hibernate_preallocate_memory(void) size = 0; for_each_populated_zone(zone) { size += snapshot_additional_pages(zone); - if (is_highmem(zone)) + if (zone_is_highmem(zone)) highmem += zone_page_state(zone, NR_FREE_PAGES); else count += zone_page_state(zone, NR_FREE_PAGES); @@ -1481,7 +1481,7 @@ static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem) unsigned int free = alloc_normal; for_each_populated_zone(zone) - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) free += zone_page_state(zone, NR_FREE_PAGES); nr_pages += count_pages_for_highmem(nr_highmem); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7e208f0..9558341 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5136,7 +5136,7 @@ static void __setup_per_zone_wmarks(void) /* Calculate total number of !ZONE_HIGHMEM pages */ for_each_zone(zone) { - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) lowmem_pages += zone->present_pages; } @@ -5146,7 +5146,7 @@ static void __setup_per_zone_wmarks(void) spin_lock_irqsave(&zone->lock, flags); tmp = (u64)pages_min * zone->present_pages; do_div(tmp, lowmem_pages); - if (is_highmem(zone)) { + if (zone_is_highmem(zone)) { /* * __GFP_HIGH and PF_MEMALLOC allocations usually don't * need highmem pages, so cap pages_min to a small @@ -5585,7 +5585,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count) * For avoiding noise data, lru_add_drain_all() should be called * If ZONE_MOVABLE, the zone never contains unmovable pages */ - if (zone_idx(zone) == ZONE_MOVABLE) + if (zone_is_movable(zone)) return false; mt = get_pageblock_migratetype(page); if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt)) > _ > > > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -58,6 +58,8 @@ > > #include > > #include > > #include > > +#include > > +#include > > #include > > > > #include > > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > } > > EXPORT_SYMBOL(get_user_pages); > > > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > +/** > > + * It's a wrapper of get_user_pages() but it makes sure that all pages come from > > + * non-movable zone via additional page migration. > > + */ > > This needs a description of why the function exists - say something > about the requirements of memory hotplug. > > Also a few words describing how the function works would be good. > > > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > > + unsigned long start, int nr_pages, int write, int force, > > + struct page **pages, struct vm_area_struct **vmas) > > +{ > > + int ret, i, isolate_err, migrate_pre_flag; migrate_pre_flag should be bool and the name is a bit unusual. migrate_prepped? > > + LIST_HEAD(pagelist); > > + > > +retry: > > + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, > > + vmas); > > We should handle (ret < 0) here. At present the function will silently > convert an error return into "return 0", which is bad. The function > does appear to otherwise do the right thing if get_user_pages() failed, > but only due to good luck. > A BUG_ON check if we retry more than once wouldn't hurt either. It requires a broken implementation of alloc_migrate_target() but someone might "fix" it and miss this. A minor aside, it would be nice if we exited at this point if there was no populated ZONE_MOVABLE in the system. There is a movable_zone global variable already. If it was forced to be MAX_NR_ZONES before the call to find_usable_zone_for_movable() in memory initialisation we should be able to make a cheap check for it here. > > + isolate_err = 0; > > + migrate_pre_flag = 0; > > + > > + for (i = 0; i < ret; i++) { > > + if (is_movable(page_zone(pages[i]))) { This is a relatively expensive lookup. If my patch above is used then you can add a is_movable_idx helper and then this becomes if (is_movable_idx(page_zonenum(pages[i]))) which is cheaper. > > + if (!migrate_pre_flag) { > > + if (migrate_prep()) > > + goto put_page; CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never return failure. You could make this BUG_ON(migrate_prep()), remove this goto and the migrate_pre_flag check below becomes redundant. > > + migrate_pre_flag = 1; > > + } > > + > > + if (!isolate_lru_page(pages[i])) { > > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > > + page_is_file_cache(pages[i])); > > + list_add_tail(&pages[i]->lru, &pagelist); > > + } else { > > + isolate_err = 1; > > + goto put_page; > > + } isolate_lru_page() takes the LRU lock every time. If get_user_pages_non_movable is used heavily then you may encounter lock contention problems. Batching this lock would be a separate patch and should not be necessary yet but you should at least comment on it as a reminder. I think that a goto could also have been avoided here if you used break. The i == ret check below would be false and it would just fall through. Overall the flow would be a bit easier to follow. Why list_add_tail()? I don't think it's wrong but it's unusual to see list_add_tail() when list_add() is enough. > > + } > > + } > > + > > + /* All pages are non movable, we are done :) */ > > + if (i == ret && list_empty(&pagelist)) > > + return ret; > > + > > +put_page: > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > > + for (i = 0; i < ret; i++) > > + put_page(pages[i]); > > + release_pages. That comment is insufficient. There are non-obvious consequences to this logic. We are dropping pins on all pages regardless of what zone they are in. If the subsequent migration fails then we end up returning 0 with no pages pinned. The user-visible effect is that io_setup() fails for non-obvious reasons. It will return EAGAIN to userspace which will be interpreted as "The specified nr_events exceeds the user's limit of available events.". The application will either fail or potentially infinite loop if the developer interpreted EAGAIN as "try again" as opposed to "this is a permanent failure". Is that deliberate? Is it really preferable that AIO can fail to setup and the application exit just in case we want to hot-remove memory later? Should a failed migration generate a WARN_ON at least? I would think that it's better to WARN_ON_ONCE if migration fails but pin the pages as requested. If a future memory hot-remove operation fails then the warning will indicate why but applications will not fail as a result. It's a little clumsy but the memory hot-remove failure message could list what applications have pinned the pages that cannot be removed so the administrator has the option of force-killing the application. It is possible to discover what application is pinning a page from userspace but it would involve an expensive search with /proc/kpagemap > > + if (migrate_pre_flag && !isolate_err) { > > + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > > + false, MIGRATE_SYNC, MR_SYSCALL); The conversion of alloc_migrate_target is a bit problematic. It strips the __GFP_MOVABLE flag and the consequence of this is that it converts those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large number of pages, particularly if the number of get_user_pages_non_movable() increases for short-lived pins like direct IO. One way around this is to add a high_zoneidx parameter to __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask) as the high_zoneidx and create a new migration allocation function that passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE > > + /* Steal pages from non-movable zone successfully? */ > > + if (!ret) > > + goto retry; > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > page_list must be reinitialised here (or, better, at the start of the loop). > page_list should be empty if ret == 0. > Mel, what's up with migrate_pages()? Shouldn't it be removing the > pages from the list when MIGRATEPAGE_SUCCESS? The use of > list_for_each_entry_safe() suggests we used to do that... > On successful migration, the pages are put back on the LRU at the end of unmap_and_move(). On migration failure the caller is responsible for putting the pages back on the LRU with putback_lru_pages() which happens below. > > + } > > + > > + putback_lru_pages(&pagelist); > > + return 0; > > +} > > +EXPORT_SYMBOL(get_user_pages_non_movable); > > +#endif > > + > -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 5 Feb 2013 13:32:44 +0000 Message-ID: <20130205133244.GH21389@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Andrew Morton Return-path: Content-Disposition: inline In-Reply-To: <20130205115722.GF21389@suse.de> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote: > > > > + migrate_pre_flag = 1; > > > + } > > > + > > > + if (!isolate_lru_page(pages[i])) { > > > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > > > + page_is_file_cache(pages[i])); > > > + list_add_tail(&pages[i]->lru, &pagelist); > > > + } else { > > > + isolate_err = 1; > > > + goto put_page; > > > + } > > isolate_lru_page() takes the LRU lock every time. Credit to Michal Hocko for bringing this up but with the number of other issues I missed that this is also broken with respect to huge page handling. hugetlbfs pages will not be on the LRU so the isolation will mess up and the migration has to be handled differently. Ordinarily hugetlbfs pages cannot be allocated from ZONE_MOVABLE but it is possible to configure it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this encounters a hugetlbfs page, it'll just blow up. The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok because the caller only cares about the physical address. Migration will also split a hugepage if it receives it but you are potentially adding tail pages to a list here and then migrating them. The split of the first page will get very confused. I'm not exactly sure what the result will be but it won't be pretty. Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled during testing? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 5 Feb 2013 13:13:04 -0800 Message-ID: <20130205131304.82a5c4cb.akpm@linux-foundation.org> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <511077E7.4040605@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Tang chen , Gu Zheng To: Lin Feng Return-path: In-Reply-To: <511077E7.4040605@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, 05 Feb 2013 11:09:27 +0800 Lin Feng wrote: > > > >> struct kvec; > >> int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > >> struct page **pages); > >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > >> index 73b64a3..5db811e 100644 > >> --- a/include/linux/mmzone.h > >> +++ b/include/linux/mmzone.h > >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > >> return (idx == ZONE_NORMAL); > >> } > >> > >> +static inline int is_movable(struct zone *zone) > >> +{ > >> + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > >> +} > > > > A better name would be zone_is_movable(). We haven't been very > > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > > > Yes, zone_is_xxx() should be a better name, but there are some analogous > definition like is_dma32() and is_normal() etc, if we only use zone_is_movable() > for movable zone it will break such naming rules. > Should we update other ones in a separate patch later or just keep the old style? I do think the old names were poorly chosen. Yes, we could fix them up sometime but it's hardly a pressing issue. > > And a neater implementation would be > > > > return zone_idx(zone) == ZONE_MOVABLE; > > > OK. After your change, should we also update for other ones such as is_normal()..? Sure. From: Andrew Morton Subject: include-linux-mmzoneh-cleanups-fix use zone_idx() some more, further simplify is_highmem() Cc: Lin Feng Cc: Mel Gorman Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix +++ a/include/linux/mmzone.h @@ -859,25 +859,18 @@ static inline int is_normal_idx(enum zon */ static inline int is_highmem(struct zone *zone) { -#ifdef CONFIG_HIGHMEM - enum zone_type idx = zone_idx(zone); - - return idx == ZONE_HIGHMEM || - (idx == ZONE_MOVABLE && zone_movable_is_highmem()); -#else - return 0; -#endif + return is_highmem_idx(zone_idx(zone)); } static inline int is_normal(struct zone *zone) { - return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL; + return zone_idx(zone) == ZONE_NORMAL; } static inline int is_dma32(struct zone *zone) { #ifdef CONFIG_ZONE_DMA32 - return zone == zone->zone_pgdat->node_zones + ZONE_DMA32; + return zone_idx(zone) == ZONE_DMA32; #else return 0; #endif @@ -886,7 +879,7 @@ static inline int is_dma32(struct zone * static inline int is_dma(struct zone *zone) { #ifdef CONFIG_ZONE_DMA - return zone == zone->zone_pgdat->node_zones + ZONE_DMA; + return zone_idx(zone) == ZONE_DMA; #else return 0; #endif _ > >> + /* All pages are non movable, we are done :) */ > >> + if (i == ret && list_empty(&pagelist)) > >> + return ret; > >> + > >> +put_page: > >> + /* Undo the effects of former get_user_pages(), we won't pin anything */ > >> + for (i = 0; i < ret; i++) > >> + put_page(pages[i]); > >> + > >> + if (migrate_pre_flag && !isolate_err) { > >> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >> + false, MIGRATE_SYNC, MR_SYSCALL); > >> + /* Steal pages from non-movable zone successfully? */ > >> + if (!ret) > >> + goto retry; > > > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > > page_list must be reinitialised here (or, better, at the start of the loop). > migrate_pages() > list_for_each_entry_safe() > unmap_and_move() > if (rc != -EAGAIN) { > list_del(&page->lru); > } ah, OK, there it is. -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michel Lespinasse Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 5 Feb 2013 18:26:51 -0800 Message-ID: References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Cc: Andrew Morton , Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Mel Gorman Return-path: In-Reply-To: <20130205115722.GF21389@suse.de> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Just nitpicking, but: On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman wrote: > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) > +{ > + /* This mess avoids a potentially expensive pointer subtraction. */ > + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > + return zone_off == idx * sizeof(*zone); > +} Maybe: return zone == zone->zone_pgdat->node_zones + idx; ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 6 Feb 2013 10:41:17 +0000 Message-ID: <20130206104117.GO21389@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Andrew Morton , Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Michel Lespinasse Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, Feb 05, 2013 at 06:26:51PM -0800, Michel Lespinasse wrote: > Just nitpicking, but: > > On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman wrote: > > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) > > +{ > > + /* This mess avoids a potentially expensive pointer subtraction. */ > > + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > > + return zone_off == idx * sizeof(*zone); > > +} > > Maybe: > return zone == zone->zone_pgdat->node_zones + idx; > ? > Not a nit at all. Yours is more readable but it generates more code. A single line function that uses the helper generates 0x3f bytes of code (mostly function entry/exit) with your version and 0x39 bytes with mine. The difference in efficiency is marginal as your version uses lea to multiply by a constant but it's still slightly heavier. The old code is fractionally better, your version is more readable so it's up to Andrew really. Right now I think he's gone with his own version with zone_idx() in the name of readability whatever sparse has to say about the matter. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 18 Feb 2013 18:34:44 +0800 Message-ID: <512203C4.8010608@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Mel Gorman Return-path: In-Reply-To: <20130205115722.GF21389@suse.de> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Mel, See below. On 02/05/2013 07:57 PM, Mel Gorman wrote: > On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: >> The ifdefs aren't really needed here and I encourage people to omit >> them. This keeps the header files looking neater and reduces the >> chances of things later breaking because we forgot to update some >> CONFIG_foo logic in a header file. The downside is that errors will be >> revealed at link time rather than at compile time, but that's a pretty >> small cost. >> > > As an aside, if ifdefs *have* to be used then it often better to have a > pattern like > > #ifdef CONFIG_MEMORY_HOTREMOVE > int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > unsigned long start, int nr_pages, int write, int force, > struct page **pages, struct vm_area_struct **vmas); > #else > static inline get_user_pages_non_movable(...) > { > get_user_pages(...) > } > #endif > > It eliminates the need for #ifdefs in the C file that calls > get_user_pages_non_movable(). Thanks for pointing out :) >>> + >>> +retry: >>> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >>> + vmas); >> >> We should handle (ret < 0) here. At present the function will silently >> convert an error return into "return 0", which is bad. The function >> does appear to otherwise do the right thing if get_user_pages() failed, >> but only due to good luck. >> > > A BUG_ON check if we retry more than once wouldn't hurt either. It requires > a broken implementation of alloc_migrate_target() but someone might "fix" > it and miss this. Agree. > > A minor aside, it would be nice if we exited at this point if there was > no populated ZONE_MOVABLE in the system. There is a movable_zone global > variable already. If it was forced to be MAX_NR_ZONES before the call to > find_usable_zone_for_movable() in memory initialisation we should be able > to make a cheap check for it here. OK. >>> + if (!migrate_pre_flag) { >>> + if (migrate_prep()) >>> + goto put_page; > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > return failure. You could make this BUG_ON(migrate_prep()), remove this > goto and the migrate_pre_flag check below becomes redundant. Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check will call/check migrate_prep() every time we isolate a single page, do we have to do so? I mean that for a group of pages it's sufficient to call migrate_prep() only once. There are some other migrate-relative codes using this scheme. So I get confused, do I miss something or the the performance cost is little? :( > >>> + migrate_pre_flag = 1; >>> + } >>> + >>> + if (!isolate_lru_page(pages[i])) { >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>> + page_is_file_cache(pages[i])); >>> + list_add_tail(&pages[i]->lru, &pagelist); >>> + } else { >>> + isolate_err = 1; >>> + goto put_page; >>> + } > > isolate_lru_page() takes the LRU lock every time. If > get_user_pages_non_movable is used heavily then you may encounter lock > contention problems. Batching this lock would be a separate patch and should > not be necessary yet but you should at least comment on it as a reminder. Farsighted, should improve it in the future. > > I think that a goto could also have been avoided here if you used break. The > i == ret check below would be false and it would just fall through. > Overall the flow would be a bit easier to follow. Yes, I noticed this before. I thought using goto could save some micro ops (here is only the i == ret check) though lower the readability but still be clearly. Also there are some other place in current kernel facing such performance/readability race condition, but it seems that the developers prefer readability, why? While I think performance is fatal to kernel.. > > Why list_add_tail()? I don't think it's wrong but it's unusual to see > list_add_tail() when list_add() is enough. Sorry, not intentional, just steal from other codes without thinking more ;-) > >>> + } >>> + } >>> + >>> + /* All pages are non movable, we are done :) */ >>> + if (i == ret && list_empty(&pagelist)) >>> + return ret; >>> + >>> +put_page: >>> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> + for (i = 0; i < ret; i++) >>> + put_page(pages[i]); >>> + > > release_pages. > > That comment is insufficient. There are non-obvious consequences to this > logic. We are dropping pins on all pages regardless of what zone they > are in. If the subsequent migration fails then we end up returning 0 > with no pages pinned. The user-visible effect is that io_setup() fails > for non-obvious reasons. It will return EAGAIN to userspace which will be > interpreted as "The specified nr_events exceeds the user's limit of available > events.". The application will either fail or potentially infinite loop > if the developer interpreted EAGAIN as "try again" as opposed to "this is > a permanent failure". Yes, I missed such things. > > Is that deliberate? Is it really preferable that AIO can fail to setup > and the application exit just in case we want to hot-remove memory later? > Should a failed migration generate a WARN_ON at least? > > I would think that it's better to WARN_ON_ONCE if migration fails but pin > the pages as requested. If a future memory hot-remove operation fails > then the warning will indicate why but applications will not fail as a I'm so agree, I will keep the pin state alive and issue a WARN_ON_ONCE message in such case. > result. It's a little clumsy but the memory hot-remove failure message > could list what applications have pinned the pages that cannot be removed > so the administrator has the option of force-killing the application. It > is possible to discover what application is pinning a page from userspace > but it would involve an expensive search with /proc/kpagemap > >>> + if (migrate_pre_flag && !isolate_err) { >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > The conversion of alloc_migrate_target is a bit problematic. It strips > the __GFP_MOVABLE flag and the consequence of this is that it converts > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > number of pages, particularly if the number of get_user_pages_non_movable() > increases for short-lived pins like direct IO. Sorry, I don't quite understand here neither. If we use the following new migration allocation function as you said, the increasing number of get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE pages. What's the difference, do I miss something? > > One way around this is to add a high_zoneidx parameter to > __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new > inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask) > as the high_zoneidx and create a new migration allocation function that > passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to > happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE On 02/05/2013 09:32 PM, Mel Gorman wrote: > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? Sorry, I hadn't considered THP and hugepage yet, I will deal such cases based on your comments later. thanks a lot for your review :) linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 18 Feb 2013 15:17:16 +0000 Message-ID: <20130218151716.GL4365@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <512203C4.8010608@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Lin Feng Return-path: Content-Disposition: inline In-Reply-To: <512203C4.8010608@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, Feb 18, 2013 at 06:34:44PM +0800, Lin Feng wrote: > >>> + if (!migrate_pre_flag) { > >>> + if (migrate_prep()) > >>> + goto put_page; > > > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > > return failure. You could make this BUG_ON(migrate_prep()), remove this > > goto and the migrate_pre_flag check below becomes redundant. > Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check > will call/check migrate_prep() every time we isolate a single page, do we have > to do so? I was referring to the migrate_pre_flag check further down and I'm not suggesting it be removed from here as you do want to call migrate_prep() only once. However, as it'll never return failure for any kernel configuration that allows memory hot-remove, this goto can be removed and the flow simplified. > > > >>> + migrate_pre_flag = 1; > >>> + } > >>> + > >>> + if (!isolate_lru_page(pages[i])) { > >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > >>> + page_is_file_cache(pages[i])); > >>> + list_add_tail(&pages[i]->lru, &pagelist); > >>> + } else { > >>> + isolate_err = 1; > >>> + goto put_page; > >>> + } > > > > isolate_lru_page() takes the LRU lock every time. If > > get_user_pages_non_movable is used heavily then you may encounter lock > > contention problems. Batching this lock would be a separate patch and should > > not be necessary yet but you should at least comment on it as a reminder. > Farsighted, should improve it in the future. > > > > > I think that a goto could also have been avoided here if you used break. The > > i == ret check below would be false and it would just fall through. > > Overall the flow would be a bit easier to follow. > Yes, I noticed this before. I thought using goto could save some micro ops > (here is only the i == ret check) though lower the readability but still be clearly. > > Also there are some other place in current kernel facing such performance/readability > race condition, but it seems that the developers prefer readability, why? While I > think performance is fatal to kernel.. > Memory hot-remove and page migration are not performance critical paths. For page migration, the cost will be likely dominated by the page copy but it's also possible the cost will be dominated by locking. The difference between a break and goto will not even be measurable. > > > > > > result. It's a little clumsy but the memory hot-remove failure message > > could list what applications have pinned the pages that cannot be removed > > so the administrator has the option of force-killing the application. It > > is possible to discover what application is pinning a page from userspace > > but it would involve an expensive search with /proc/kpagemap > > > >>> + if (migrate_pre_flag && !isolate_err) { > >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > > > The conversion of alloc_migrate_target is a bit problematic. It strips > > the __GFP_MOVABLE flag and the consequence of this is that it converts > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > > number of pages, particularly if the number of get_user_pages_non_movable() > > increases for short-lived pins like direct IO. > > Sorry, I don't quite understand here neither. If we use the following new > migration allocation function as you said, the increasing number of > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE > pages. What's the difference, do I miss something? > The replacement function preserves the __GFP_MOVABLE flag. It cannot use ZONE_MOVABLE but otherwise the newly allocated page will be grouped with other movable pages. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 19 Feb 2013 17:55:30 +0800 Message-ID: <51234C12.4020404@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <512203C4.8010608@cn.fujitsu.com> <20130218151716.GL4365@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Mel Gorman Return-path: In-Reply-To: <20130218151716.GL4365@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Mel, On 02/18/2013 11:17 PM, Mel Gorman wrote: >>> > > >>> > > >>> > > result. It's a little clumsy but the memory hot-remove failure message >>> > > could list what applications have pinned the pages that cannot be removed >>> > > so the administrator has the option of force-killing the application. It >>> > > is possible to discover what application is pinning a page from userspace >>> > > but it would involve an expensive search with /proc/kpagemap >>> > > >>>>> > >>> + if (migrate_pre_flag && !isolate_err) { >>>>> > >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >>>>> > >>> + false, MIGRATE_SYNC, MR_SYSCALL); >>> > > >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large >>> > > number of pages, particularly if the number of get_user_pages_non_movable() >>> > > increases for short-lived pins like direct IO. >> > >> > Sorry, I don't quite understand here neither. If we use the following new >> > migration allocation function as you said, the increasing number of >> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE >> > pages. What's the difference, do I miss something? >> > > The replacement function preserves the __GFP_MOVABLE flag. It cannot use > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with > other movable pages. Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.". Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little change? Does that approach work? Otherwise I have to follow your suggestion. thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 19 Feb 2013 10:34:59 +0000 Message-ID: <20130219103458.GN4365@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <512203C4.8010608@cn.fujitsu.com> <20130218151716.GL4365@suse.de> <51234C12.4020404@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Lin Feng Return-path: Content-Disposition: inline In-Reply-To: <51234C12.4020404@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, Feb 19, 2013 at 05:55:30PM +0800, Lin Feng wrote: > Hi Mel, > > On 02/18/2013 11:17 PM, Mel Gorman wrote: > >>> > > > >>> > > > >>> > > result. It's a little clumsy but the memory hot-remove failure message > >>> > > could list what applications have pinned the pages that cannot be removed > >>> > > so the administrator has the option of force-killing the application. It > >>> > > is possible to discover what application is pinning a page from userspace > >>> > > but it would involve an expensive search with /proc/kpagemap > >>> > > > >>>>> > >>> + if (migrate_pre_flag && !isolate_err) { > >>>>> > >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >>>>> > >>> + false, MIGRATE_SYNC, MR_SYSCALL); > >>> > > > >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips > >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts > >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > >>> > > number of pages, particularly if the number of get_user_pages_non_movable() > >>> > > increases for short-lived pins like direct IO. > >> > > >> > Sorry, I don't quite understand here neither. If we use the following new > >> > migration allocation function as you said, the increasing number of > >> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE > >> > pages. What's the difference, do I miss something? > >> > > > The replacement function preserves the __GFP_MOVABLE flag. It cannot use > > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with > > other movable pages. > > Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.". > Update the comment and describe the exception then. > Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so > that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little > change? > It should work (double check gfp_zone) but then the allocation cannot use the highmem zone. If you can be 100% certain that zone will not exist be populated then it will work as expected but it's a hack and should be commented clearly. You could do a BUILD_BUG_ON if CONFIG_HIGHMEM is set to enforce it. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 19 Feb 2013 21:37:55 +0800 Message-ID: <51238033.6010005@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Mel Gorman Return-path: In-Reply-To: <20130205133244.GH21389@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Mel, On 02/05/2013 09:32 PM, Mel Gorman wrote: > On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote: >> >>>> + migrate_pre_flag = 1; >>>> + } >>>> + >>>> + if (!isolate_lru_page(pages[i])) { >>>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>>> + page_is_file_cache(pages[i])); >>>> + list_add_tail(&pages[i]->lru, &pagelist); >>>> + } else { >>>> + isolate_err = 1; >>>> + goto put_page; >>>> + } >> >> isolate_lru_page() takes the LRU lock every time. > > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. I look into the migrate_huge_page() codes find that if we support the hugetlbfs non movable migration, we have to invent another alloc_huge_page_node_nonmovable() or such allocate interface, which cost is large(exploding the codes and great impact on current alloc_huge_page_node()) but gains little, I think that pinning hugepage is a corner case. So can we skip over hugepage without migration but give some WARN_ON() info, is it acceptable? > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok I can't find codes doing such things :(, could you please point me out? > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? I checked my config file that both CONFIG options aboved are enabled. However it was only be tested by two services invoking io_setup(), it works fine.. thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 10:34:36 +0800 Message-ID: <5124363C.9060604@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51238033.6010005@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Mel Gorman Return-path: In-Reply-To: <51238033.6010005@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 02/19/2013 09:37 PM, Lin Feng wrote: >> > >> > The other is that this almost certainly broken for transhuge page >> > handling. gup returns the head and tail pages and ordinarily this is ok > I can't find codes doing such things :(, could you please point me out? > Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. flee... thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wanpeng Li Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 10:44:35 +0800 Message-ID: <19124.1707573228$1361328321@news.gmane.org> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51238033.6010005@cn.fujitsu.com> <5124363C.9060604@cn.fujitsu.com> Reply-To: Wanpeng Li Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Lin Feng Return-path: Content-Disposition: inline In-Reply-To: <5124363C.9060604@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Feb 20, 2013 at 10:34:36AM +0800, Lin Feng wrote: > > >On 02/19/2013 09:37 PM, Lin Feng wrote: >>> > >>> > The other is that this almost certainly broken for transhuge page >>> > handling. gup returns the head and tail pages and ordinarily this is ok >> I can't find codes doing such things :(, could you please point me out? >> >Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. >flee... According to the compound page, the first page of compound page is called head page, other sub pages are called tail pages. Regards, Wanpeng Li > >thanks, >linfeng > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wanpeng Li Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 10:44:35 +0800 Message-ID: <28260.3998123252$1361328322@news.gmane.org> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51238033.6010005@cn.fujitsu.com> <5124363C.9060604@cn.fujitsu.com> Reply-To: Wanpeng Li Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Lin Feng Return-path: Content-Disposition: inline In-Reply-To: <5124363C.9060604@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Feb 20, 2013 at 10:34:36AM +0800, Lin Feng wrote: > > >On 02/19/2013 09:37 PM, Lin Feng wrote: >>> > >>> > The other is that this almost certainly broken for transhuge page >>> > handling. gup returns the head and tail pages and ordinarily this is ok >> I can't find codes doing such things :(, could you please point me out? >> >Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. >flee... According to the compound page, the first page of compound page is called head page, other sub pages are called tail pages. Regards, Wanpeng Li > >thanks, >linfeng > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 10:59:03 +0800 Message-ID: <51243BF7.3030004@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51238033.6010005@cn.fujitsu.com> <5124363C.9060604@cn.fujitsu.com> <20130220024435.GA30208@hacker.(null)> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Wanpeng Li Return-path: In-Reply-To: <20130220024435.GA30208@hacker.(null)> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org Hi Wanpeng, On 02/20/2013 10:44 AM, Wanpeng Li wrote: >> Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. >> >flee... > According to the compound page, the first page of compound page is > called head page, other sub pages are called tail pages. > > Regards, > Wanpeng Li > Yes, you are right, thanks for explaining. I thought it just means only the last one of the compound pages.. thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 From: Simon Jeons Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 17:58:22 +0800 Message-ID: <51249E3E.9070909@gmail.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: Andrew Morton , Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Mel Gorman Return-path: In-Reply-To: <20130205133244.GH21389@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 02/05/2013 09:32 PM, Mel Gorman wrote: > On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote: >>>> + migrate_pre_flag = 1; >>>> + } >>>> + >>>> + if (!isolate_lru_page(pages[i])) { >>>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>>> + page_is_file_cache(pages[i])); >>>> + list_add_tail(&pages[i]->lru, &pagelist); >>>> + } else { >>>> + isolate_err = 1; >>>> + goto put_page; >>>> + } >> isolate_lru_page() takes the LRU lock every time. > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. As you said, hugetlbfs pages are not in LRU list, then how can encounter a hugetlbfs page and blow up? > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok When need gup thp? in kvm case? > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 18:23:39 +0800 Message-ID: <5124A42B.1020905@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51249E3E.9070909@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Simon Jeons Return-path: In-Reply-To: <51249E3E.9070909@gmail.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Simon, On 02/20/2013 05:58 PM, Simon Jeons wrote: > >> >> The other is that this almost certainly broken for transhuge page >> handling. gup returns the head and tail pages and ordinarily this is ok > > When need gup thp? in kvm case? gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. We can't expect the pages have been allocated for user address space are thp or normal page. So we need to deal with them and I think it have nothing to do with kvm. thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Simon Jeons Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 19:31:54 +0800 Message-ID: <5124B42A.1020908@gmail.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51249E3E.9070909@gmail.com> <5124A42B.1020905@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Lin Feng Return-path: In-Reply-To: <5124A42B.1020905@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 02/20/2013 06:23 PM, Lin Feng wrote: > Hi Simon, > > On 02/20/2013 05:58 PM, Simon Jeons wrote: >>> The other is that this almost certainly broken for transhuge page >>> handling. gup returns the head and tail pages and ordinarily this is ok >> When need gup thp? in kvm case? > gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. > We can't expect the pages have been allocated for user address space are thp or > normal page. So we need to deal with them and I think it have nothing to do with kvm. Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present? > > thanks, > linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 19:54:57 +0800 Message-ID: <5124B991.1020302@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51249E3E.9070909@gmail.com> <5124A42B.1020905@cn.fujitsu.com> <5124B42A.1020908@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Simon Jeons Return-path: In-Reply-To: <5124B42A.1020908@gmail.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 02/20/2013 07:31 PM, Simon Jeons wrote: > On 02/20/2013 06:23 PM, Lin Feng wrote: >> Hi Simon, >> >> On 02/20/2013 05:58 PM, Simon Jeons wrote: >>>> The other is that this almost certainly broken for transhuge page >>>> handling. gup returns the head and tail pages and ordinarily this is ok >>> When need gup thp? in kvm case? >> gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. >> We can't expect the pages have been allocated for user address space are thp or >> normal page. So we need to deal with them and I think it have nothing to do with kvm. > > Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present? No, userspace process doesn't pin any pages directly but through some syscalls like io_setup() indirectly for other purpose because kernel can't pagefault and it have to keep the page alive. Kernel wants to communicate with the userspace such as to notify some events so it need some sort of buffer that both Kernel and User space can both access, which leads to so called pin pages by gup. thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Lin Feng Subject: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 4 Feb 2013 18:04:07 +0800 Message-Id: <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> In-Reply-To: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng get_user_pages() always tries to allocate pages from movable zone, which is not reliable to memory hotremove framework in some case. This patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. Cc: Andrew Morton Cc: Mel Gorman Cc: KAMEZAWA Hiroyuki Cc: Yasuaki Ishimatsu Reviewed-by: Tang Chen Reviewed-by: Gu Zheng Signed-off-by: Lin Feng --- include/linux/mm.h | 5 ++++ include/linux/mmzone.h | 4 ++++ mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 ++++ 4 files changed, 77 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 66e2f7c..2a25d0e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas); +#endif struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 73b64a3..5db811e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) return (idx == ZONE_NORMAL); } +static inline int is_movable(struct zone *zone) +{ + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; +} /** * is_highmem - helper function to quickly check if a struct zone is a * highmem zone or not. This is an attempt to keep references diff --git a/mm/memory.c b/mm/memory.c index bb1369f..e3b8e19 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -58,6 +58,8 @@ #include #include #include +#include +#include #include #include @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +/** + * It's a wrapper of get_user_pages() but it makes sure that all pages come from + * non-movable zone via additional page migration. + */ +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas) +{ + int ret, i, isolate_err, migrate_pre_flag; + LIST_HEAD(pagelist); + +retry: + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, + vmas); + + isolate_err = 0; + migrate_pre_flag = 0; + + for (i = 0; i < ret; i++) { + if (is_movable(page_zone(pages[i]))) { + if (!migrate_pre_flag) { + if (migrate_prep()) + goto put_page; + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + + page_is_file_cache(pages[i])); + list_add_tail(&pages[i]->lru, &pagelist); + } else { + isolate_err = 1; + goto put_page; + } + } + } + + /* All pages are non movable, we are done :) */ + if (i == ret && list_empty(&pagelist)) + return ret; + +put_page: + /* Undo the effects of former get_user_pages(), we won't pin anything */ + for (i = 0; i < ret; i++) + put_page(pages[i]); + + if (migrate_pre_flag && !isolate_err) { + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, + false, MIGRATE_SYNC, MR_SYSCALL); + /* Steal pages from non-movable zone successfully? */ + if (!ret) + goto retry; + } + + putback_lru_pages(&pagelist); + return 0; +} +EXPORT_SYMBOL(get_user_pages_non_movable); +#endif + /** * get_dump_page() - pin user page in memory while writing it to core dump * @addr: user address diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 383bdbb..1b7bd17 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, return ret ? 0 : -EBUSY; } +/** + * @private: 0 means page can be alloced from movable zone, otherwise forbidden + */ struct page *alloc_migrate_target(struct page *page, unsigned long private, int **resultp) { @@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private, if (PageHighMem(page)) gfp_mask |= __GFP_HIGHMEM; + if (unlikely(private != 0)) + gfp_mask &= ~__GFP_MOVABLE; return alloc_page(gfp_mask); } -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Lin Feng Subject: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Date: Mon, 4 Feb 2013 18:04:08 +0800 Message-Id: <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> In-Reply-To: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng This patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works as configed with CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). Cc: Benjamin LaHaise Cc: Alexander Viro Cc: Andrew Morton Reviewed-by: Tang Chen Reviewed-by: Gu Zheng Signed-off-by: Lin Feng --- fs/aio.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/aio.c b/fs/aio.c index 71f613c..0e9b30a 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) } dprintk("mmap address: 0x%08lx\n", info->mmap_base); +#ifdef CONFIG_MEMORY_HOTREMOVE + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, + info->mmap_base, nr_pages, + 1, 0, info->ring_pages, NULL); +#else info->nr_pages = get_user_pages(current, ctx->mm, info->mmap_base, nr_pages, 1, 0, info->ring_pages, NULL); +#endif up_write(&ctx->mm->mmap_sem); if (unlikely(info->nr_pages != nr_pages)) { -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Jeff Moyer Subject: Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> Date: Mon, 04 Feb 2013 10:18:03 -0500 In-Reply-To: <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> (Lin Feng's message of "Mon, 4 Feb 2013 18:04:08 +0800") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: Lin Feng Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Lin Feng writes: > This patch gets around the aio ring pages can't be migrated bug caused by > get_user_pages() via using the new function. It only works as configed with > CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > > Cc: Benjamin LaHaise > Cc: Alexander Viro > Cc: Andrew Morton > Reviewed-by: Tang Chen > Reviewed-by: Gu Zheng > Signed-off-by: Lin Feng > --- > fs/aio.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/fs/aio.c b/fs/aio.c > index 71f613c..0e9b30a 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) > } > > dprintk("mmap address: 0x%08lx\n", info->mmap_base); > +#ifdef CONFIG_MEMORY_HOTREMOVE > + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, > + info->mmap_base, nr_pages, > + 1, 0, info->ring_pages, NULL); > +#else > info->nr_pages = get_user_pages(current, ctx->mm, > info->mmap_base, nr_pages, > 1, 0, info->ring_pages, NULL); > +#endif Can't you hide this in your 1/1 patch, by providing this function as just a static inline wrapper around get_user_pages when CONFIG_MEMORY_HOTREMOVE is not enabled? Cheers, Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 4 Feb 2013 15:02:09 -0800 From: Zach Brown Subject: Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Message-ID: <20130204230209.GK14246@lenny.home.zabbo.net> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Jeff Moyer Cc: Lin Feng , akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org > > index 71f613c..0e9b30a 100644 > > --- a/fs/aio.c > > +++ b/fs/aio.c > > @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) > > } > > > > dprintk("mmap address: 0x%08lx\n", info->mmap_base); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, > > + info->mmap_base, nr_pages, > > + 1, 0, info->ring_pages, NULL); > > +#else > > info->nr_pages = get_user_pages(current, ctx->mm, > > info->mmap_base, nr_pages, > > 1, 0, info->ring_pages, NULL); > > +#endif > > Can't you hide this in your 1/1 patch, by providing this function as > just a static inline wrapper around get_user_pages when > CONFIG_MEMORY_HOTREMOVE is not enabled? Yes, please. Having callers duplicate the call site for a single optional boolean input is unacceptable. But do we want another input argument as a name? Should aio have been using get_user_pages_fast()? (and so now _fast_non_movable?) I wonder if it's time to offer the booleans as a _flags() variant, much like the current internal flags for __get_user_pages(). The write and force arguments are already booleans, we have a different fast api, and now we're adding non-movable. The NON_MOVABLE flag would be 0 without MEMORY_HOTREMOVE, easy peasy. Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might also be nice :). No? - z -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 4 Feb 2013 16:06:24 -0800 From: Andrew Morton Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-Id: <20130204160624.5c20a8a0.akpm@linux-foundation.org> In-Reply-To: <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Lin Feng Cc: mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org melreadthis On Mon, 4 Feb 2013 18:04:07 +0800 Lin Feng wrote: > get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > > This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > ... > > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > struct page **pages, struct vm_area_struct **vmas); > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages); > +#ifdef CONFIG_MEMORY_HOTREMOVE > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, int nr_pages, int write, int force, > + struct page **pages, struct vm_area_struct **vmas); > +#endif The ifdefs aren't really needed here and I encourage people to omit them. This keeps the header files looking neater and reduces the chances of things later breaking because we forgot to update some CONFIG_foo logic in a header file. The downside is that errors will be revealed at link time rather than at compile time, but that's a pretty small cost. > struct kvec; > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > struct page **pages); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 73b64a3..5db811e 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > return (idx == ZONE_NORMAL); > } > > +static inline int is_movable(struct zone *zone) > +{ > + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > +} A better name would be zone_is_movable(). We haven't been very consistent about this in mmzone.h, but zone_is_foo() is pretty common. And a neater implementation would be return zone_idx(zone) == ZONE_MOVABLE; All of which made me look at mmzone.h, and what I saw wasn't very nice :( Please review... From: Andrew Morton Subject: include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - use zone_idx() in is_highmem() to remove large amounts of silly fluff. Cc: Lin Feng Cc: Mel Gorman Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups +++ a/include/linux/mmzone.h @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone->zone_pgdat->node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon static inline int is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || - (zone_off == ZONE_MOVABLE * sizeof(*zone) && - zone_movable_is_highmem()); + enum zone_type idx = zone_idx(zone); + + return idx == ZONE_HIGHMEM || + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); #else return 0; #endif _ > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -58,6 +58,8 @@ > #include > #include > #include > +#include > +#include > #include > > #include > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > } > EXPORT_SYMBOL(get_user_pages); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > +/** > + * It's a wrapper of get_user_pages() but it makes sure that all pages come from > + * non-movable zone via additional page migration. > + */ This needs a description of why the function exists - say something about the requirements of memory hotplug. Also a few words describing how the function works would be good. > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, int nr_pages, int write, int force, > + struct page **pages, struct vm_area_struct **vmas) > +{ > + int ret, i, isolate_err, migrate_pre_flag; > + LIST_HEAD(pagelist); > + > +retry: > + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, > + vmas); We should handle (ret < 0) here. At present the function will silently convert an error return into "return 0", which is bad. The function does appear to otherwise do the right thing if get_user_pages() failed, but only due to good luck. > + isolate_err = 0; > + migrate_pre_flag = 0; > + > + for (i = 0; i < ret; i++) { > + if (is_movable(page_zone(pages[i]))) { > + if (!migrate_pre_flag) { > + if (migrate_prep()) > + goto put_page; > + migrate_pre_flag = 1; > + } > + > + if (!isolate_lru_page(pages[i])) { > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > + page_is_file_cache(pages[i])); > + list_add_tail(&pages[i]->lru, &pagelist); > + } else { > + isolate_err = 1; > + goto put_page; > + } > + } > + } > + > + /* All pages are non movable, we are done :) */ > + if (i == ret && list_empty(&pagelist)) > + return ret; > + > +put_page: > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > + for (i = 0; i < ret; i++) > + put_page(pages[i]); > + > + if (migrate_pre_flag && !isolate_err) { > + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > + false, MIGRATE_SYNC, MR_SYSCALL); > + /* Steal pages from non-movable zone successfully? */ > + if (!ret) > + goto retry; This is buggy. migrate_pages() doesn't empty its `from' argument, so page_list must be reinitialised here (or, better, at the start of the loop). Mel, what's up with migrate_pages()? Shouldn't it be removing the pages from the list when MIGRATEPAGE_SUCCESS? The use of list_for_each_entry_safe() suggests we used to do that... > + } > + > + putback_lru_pages(&pagelist); > + return 0; > +} > +EXPORT_SYMBOL(get_user_pages_non_movable); > +#endif > + Generally, I'd like Mel to go through (the next version of) this patch carefully, please. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 5 Feb 2013 09:58:59 +0900 From: Minchan Kim Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Message-ID: <20130205005859.GE2610@blaptop> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Lin Feng Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Hello, On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > Currently get_user_pages() always tries to allocate pages from movable zone, > as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > of get_user_pages() is easy to pin user pages for a long time(for now we found > that pages pinned as aio ring pages is such case), which is fatal for memory > hotplug/remove framework. > > So the 1st patch introduces a new library function called > get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > The 2nd patch gets around the aio ring pages can't be migrated bug caused by > get_user_pages() via using the new function. It only works when configed with > CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). CMA has same issue but the problem is the driver developers or any subsystem using GUP can't know their pages is in CMA area or not in advance. So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? Even some driver module in embedded side doesn't open their source code. I would like to make GUP smart so it allocates a page from non-movable/non-cma area when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :( > > Lin Feng (2): > mm: hotplug: implement non-movable version of get_user_pages() > fs/aio.c: use non-movable version of get_user_pages() to pin ring > pages when support memory hotremove > > fs/aio.c | 6 +++++ > include/linux/mm.h | 5 ++++ > include/linux/mmzone.h | 4 ++++ > mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/page_isolation.c | 5 ++++ > 5 files changed, 83 insertions(+) > > -- > 1.7.11.7 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <511077E7.4040605@cn.fujitsu.com> Date: Tue, 05 Feb 2013 11:09:27 +0800 From: Lin Feng MIME-Version: 1.0 Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> In-Reply-To: <20130204160624.5c20a8a0.akpm@linux-foundation.org> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Tang chen , Gu Zheng Hi Andrew, On 02/05/2013 08:06 AM, Andrew Morton wrote: > > melreadthis > > On Mon, 4 Feb 2013 18:04:07 +0800 > Lin Feng wrote: > >> get_user_pages() always tries to allocate pages from movable zone, which is not >> reliable to memory hotremove framework in some case. >> >> This patch introduces a new library function called get_user_pages_non_movable() >> to pin pages only from zone non-movable in memory. >> It's a wrapper of get_user_pages() but it makes sure that all pages come from >> non-movable zone via additional page migration. >> >> ... >> >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, >> struct page **pages, struct vm_area_struct **vmas); >> int get_user_pages_fast(unsigned long start, int nr_pages, int write, >> struct page **pages); >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >> + unsigned long start, int nr_pages, int write, int force, >> + struct page **pages, struct vm_area_struct **vmas); >> +#endif > > The ifdefs aren't really needed here and I encourage people to omit > them. This keeps the header files looking neater and reduces the > chances of things later breaking because we forgot to update some > CONFIG_foo logic in a header file. The downside is that errors will be > revealed at link time rather than at compile time, but that's a pretty > small cost. OK, got it, thanks :) > >> struct kvec; >> int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, >> struct page **pages); >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 73b64a3..5db811e 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) >> return (idx == ZONE_NORMAL); >> } >> >> +static inline int is_movable(struct zone *zone) >> +{ >> + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; >> +} > > A better name would be zone_is_movable(). We haven't been very > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > Yes, zone_is_xxx() should be a better name, but there are some analogous definition like is_dma32() and is_normal() etc, if we only use zone_is_movable() for movable zone it will break such naming rules. Should we update other ones in a separate patch later or just keep the old style? > And a neater implementation would be > > return zone_idx(zone) == ZONE_MOVABLE; > OK. After your change, should we also update for other ones such as is_normal()..? > All of which made me look at mmzone.h, and what I saw wasn't very nice :( > > Please review... > > > From: Andrew Morton > Subject: include/linux/mmzone.h: cleanups > > - implement zone_idx() in C to fix its references-args-twice macro bug > > - use zone_idx() in is_highmem() to remove large amounts of silly fluff. > > Cc: Lin Feng > Cc: Mel Gorman > Signed-off-by: Andrew Morton > --- > > include/linux/mmzone.h | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h > --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups > +++ a/include/linux/mmzone.h > @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by > /* > * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. > */ > -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) > +static inline enum zone_type zone_idx(struct zone *zone) > +{ > + return zone - zone->zone_pgdat->node_zones; > +} > > static inline int populated_zone(struct zone *zone) > { > @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon > static inline int is_highmem(struct zone *zone) > { > #ifdef CONFIG_HIGHMEM > - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || > - (zone_off == ZONE_MOVABLE * sizeof(*zone) && > - zone_movable_is_highmem()); > + enum zone_type idx = zone_idx(zone); > + > + return idx == ZONE_HIGHMEM || > + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); > #else > return 0; > #endif > _ > > >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -58,6 +58,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> #include >> >> #include >> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, >> } >> EXPORT_SYMBOL(get_user_pages); >> >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> +/** >> + * It's a wrapper of get_user_pages() but it makes sure that all pages come from >> + * non-movable zone via additional page migration. >> + */ > > This needs a description of why the function exists - say something > about the requirements of memory hotplug. > > Also a few words describing how the function works would be good. OK, I will add them in next version. > >> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >> + unsigned long start, int nr_pages, int write, int force, >> + struct page **pages, struct vm_area_struct **vmas) >> +{ >> + int ret, i, isolate_err, migrate_pre_flag; >> + LIST_HEAD(pagelist); >> + >> +retry: >> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >> + vmas); > > We should handle (ret < 0) here. At present the function will silently > convert an error return into "return 0", which is bad. The function > does appear to otherwise do the right thing if get_user_pages() failed, > but only due to good luck. Sorry, I forgot this, thanks. > >> + isolate_err = 0; >> + migrate_pre_flag = 0; >> + >> + for (i = 0; i < ret; i++) { >> + if (is_movable(page_zone(pages[i]))) { >> + if (!migrate_pre_flag) { >> + if (migrate_prep()) >> + goto put_page; >> + migrate_pre_flag = 1; >> + } >> + >> + if (!isolate_lru_page(pages[i])) { >> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >> + page_is_file_cache(pages[i])); >> + list_add_tail(&pages[i]->lru, &pagelist); >> + } else { >> + isolate_err = 1; >> + goto put_page; >> + } >> + } >> + } >> + >> + /* All pages are non movable, we are done :) */ >> + if (i == ret && list_empty(&pagelist)) >> + return ret; >> + >> +put_page: >> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >> + for (i = 0; i < ret; i++) >> + put_page(pages[i]); >> + >> + if (migrate_pre_flag && !isolate_err) { >> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >> + false, MIGRATE_SYNC, MR_SYSCALL); >> + /* Steal pages from non-movable zone successfully? */ >> + if (!ret) >> + goto retry; > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > page_list must be reinitialised here (or, better, at the start of the loop). migrate_pages() list_for_each_entry_safe() unmap_and_move() if (rc != -EAGAIN) { list_del(&page->lru); } IUUC, the page migrated successfully will be remove from the `from` list here. So if all pages having been migrated successlly, the list will be empty, and we goto retry. Otherwise the rest of the immigrated pages will be handled later by putback_lru_pages(&pagelist), and the function just return. Also there are many places using such logic: 1274 nr_failed = migrate_pages(&pagelist, new_vma_page, 1275 (unsigned long)vma, 1276 false, MIGRATE_SYNC, 1277 MR_MEMPOLICY_MBIND); 1278 if (nr_failed) 1279 putback_lru_pages(&pagelist); Since we traverse the list and handle each page separately, It's likely that we have migrated some page successfully before we encounter a failure. If the pagelist is still carrying the successfully migrated pages when migrate_pages() return, such code is bugyy. > > Mel, what's up with migrate_pages()? Shouldn't it be removing the > pages from the list when MIGRATEPAGE_SUCCESS? The use of > list_for_each_entry_safe() suggests we used to do that... > >> + } >> + >> + putback_lru_pages(&pagelist); >> + return 0; >> +} >> +EXPORT_SYMBOL(get_user_pages_non_movable); >> +#endif >> + > > Generally, I'd like Mel to go through (the next version of) this patch > carefully, please. > > On 02/05/2013 08:18 AM, Andrew Morton wrote:> On Mon, 4 Feb 2013 16:06:24 -0800 > Andrew Morton wrote: > >>> > > +put_page: >>> > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> > > + for (i = 0; i < ret; i++) >>> > > + put_page(pages[i]); > We can use release_pages() here. > > release_pages() is designed to be more efficient when we're putting the > final reference to (most of) the pages. It probably has little if any > benefit when putting still-in-use pages, as we're doing here. > > But please consider... OK, I will try :) > thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 5 Feb 2013 14:25:17 +0900 From: Minchan Kim Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Message-ID: <20130205052517.GH2610@blaptop> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> <51108DC8.4090704@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51108DC8.4090704@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Lin Feng Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Hi Lin, On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote: > Hi Minchan, > > On 02/05/2013 08:58 AM, Minchan Kim wrote: > > Hello, > > > > On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > >> Currently get_user_pages() always tries to allocate pages from movable zone, > >> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > >> of get_user_pages() is easy to pin user pages for a long time(for now we found > >> that pages pinned as aio ring pages is such case), which is fatal for memory > >> hotplug/remove framework. > >> > >> So the 1st patch introduces a new library function called > >> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > >> It's a wrapper of get_user_pages() but it makes sure that all pages come from > >> non-movable zone via additional page migration. > >> > >> The 2nd patch gets around the aio ring pages can't be migrated bug caused by > >> get_user_pages() via using the new function. It only works when configed with > >> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > > > > CMA has same issue but the problem is the driver developers or any subsystem > > using GUP can't know their pages is in CMA area or not in advance. > > So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? > > Even some driver module in embedded side doesn't open their source code. > Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users > of GUP, they will release the pinned pages immediately and to such users they should get > a good performance, using the old style interface is a smart way. And we had better just > deal with the cases we have to by using the new interface. Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know in advance they will use pinned pages long time or release in short time because it depends on some event like user's response which is very not predetermined. So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of failing migration by page count and then, investigate they are really using GUP and it's REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? Yes. it could be done but it would be rather trobulesome job. > > > > > I would like to make GUP smart so it allocates a page from non-movable/non-cma area > > when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP > > performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :( > As I debuged the get_user_pages(), I found that some pages is already there and may be > allocated before we call get_user_pages(). __get_user_pages() have following logic to > handle such case. > 1786 while (!(page = follow_page(vma, start, foll_flags))) { > 1787 int ret; > To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP > as smart as we want :( , so I worked out the migration approach to get around and > avoid messing up the current code :) I didn't look at your code in detail yet but What's wrong following this? Change curret GUP with old_GUP. #ifdef CONFIG_MIGRATE_ISOLATE int get_user_pages_non_movable() { .. old_get_user_pages() .. } int get_user_pages() { return get_user_pages_non_movable(); } #else int get_user_pages() { return old_get_user_pages() } #endif > > thanks, > linfeng > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <51109A38.7050605@cn.fujitsu.com> Date: Tue, 05 Feb 2013 13:35:52 +0800 From: Lin Feng MIME-Version: 1.0 Subject: Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> <20130204230209.GK14246@lenny.home.zabbo.net> In-Reply-To: <20130204230209.GK14246@lenny.home.zabbo.net> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Zach Brown Cc: Jeff Moyer , akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Tang chen , Gu Zheng Hi Zach, On 02/05/2013 07:02 AM, Zach Brown wrote: >>> index 71f613c..0e9b30a 100644 >>> --- a/fs/aio.c >>> +++ b/fs/aio.c >>> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) >>> } >>> >>> dprintk("mmap address: 0x%08lx\n", info->mmap_base); >>> +#ifdef CONFIG_MEMORY_HOTREMOVE >>> + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, >>> + info->mmap_base, nr_pages, >>> + 1, 0, info->ring_pages, NULL); >>> +#else >>> info->nr_pages = get_user_pages(current, ctx->mm, >>> info->mmap_base, nr_pages, >>> 1, 0, info->ring_pages, NULL); >>> +#endif >> >> Can't you hide this in your 1/1 patch, by providing this function as >> just a static inline wrapper around get_user_pages when >> CONFIG_MEMORY_HOTREMOVE is not enabled? > > Yes, please. Having callers duplicate the call site for a single > optional boolean input is unacceptable. I will deal with it in next version :) > > But do we want another input argument as a name? Should aio have been > using get_user_pages_fast()? (and so now _fast_non_movable?) > > I wonder if it's time to offer the booleans as a _flags() variant, much > like the current internal flags for __get_user_pages(). The write and > force arguments are already booleans, we have a different fast api, and > now we're adding non-movable. The NON_MOVABLE flag would be 0 without > MEMORY_HOTREMOVE, easy peasy. As my next reply-mail mentioned, IIUC in GUP case additional flags seems doesn't work, I abstract here: As I debuged the get_user_pages(), I found that some pages is already there and may be allocated before we call get_user_pages(). __get_user_pages() have following logic to handle such case. 1786 while (!(page = follow_page(vma, start, foll_flags))) { 1787 int ret; To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP as smart as we want , so I worked out the migration approach to get around and avoid messing up the current code. And even worse we have already got *8* arguments...Maybe we have to rework the boolean arguments into bit flags... It seems not a little work :( > > Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might > also be nice :). Agree, maybe we could handle them later :) thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 5 Feb 2013 16:45:19 +0900 From: Minchan Kim Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Message-ID: <20130205074519.GB11197@blaptop> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> <51108DC8.4090704@cn.fujitsu.com> <20130205052517.GH2610@blaptop> <5110A442.5000707@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5110A442.5000707@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Lin Feng Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Feb 05, 2013 at 02:18:42PM +0800, Lin Feng wrote: > > > On 02/05/2013 01:25 PM, Minchan Kim wrote: > > Hi Lin, > > > > On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote: > >> Hi Minchan, > >> > >> On 02/05/2013 08:58 AM, Minchan Kim wrote: > >>> Hello, > >>> > >>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > >>>> Currently get_user_pages() always tries to allocate pages from movable zone, > >>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > >>>> of get_user_pages() is easy to pin user pages for a long time(for now we found > >>>> that pages pinned as aio ring pages is such case), which is fatal for memory > >>>> hotplug/remove framework. > >>>> > >>>> So the 1st patch introduces a new library function called > >>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > >>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from > >>>> non-movable zone via additional page migration. > >>>> > >>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by > >>>> get_user_pages() via using the new function. It only works when configed with > >>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > >>> > >>> CMA has same issue but the problem is the driver developers or any subsystem > >>> using GUP can't know their pages is in CMA area or not in advance. > >>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? > >>> Even some driver module in embedded side doesn't open their source code. > >> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users > >> of GUP, they will release the pinned pages immediately and to such users they should get > >> a good performance, using the old style interface is a smart way. And we had better just > >> deal with the cases we have to by using the new interface. > > > > Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages > > immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power > > by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know > > in advance they will use pinned pages long time or release in short time because it depends > > on some event like user's response which is very not predetermined. > > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > > failing migration by page count and then, investigate they are really using GUP and it's > > REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? > > > > Yes. it could be done but it would be rather trobulesome job. > Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is > a big job. > If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the > *non_movable* version of GUP. > But in some server environment we want to keep the performance but also want to use hotremove > feature in case. Maybe patch the place as we need is a trade off for such support. > > Mel also said in the last discussion: > > On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote: > >> Well, that's a fairly low-level implementation detail. A more typical > >> approach would be to add a new get_user_pages_non_movable() or such. > >> That would probably have the same signature as get_user_pages(), with > >> one additional argument. Then get_user_pages() becomes a one-line > >> wrapper which passes in a particular value of that argument. > >> > > > > That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE > > allocations. That will impact THP availability by increasing the number > > of MIGRATE_UNMOVABLE blocks that exist and it would hit every user -- > > not just those that care about ZONE_MOVABLE. > > > > I'm likely to NAK such a patch if it's only about node hot-remove because > > it's much more of a corner case than wanting to use THP. > > > > I would prefer if get_user_pages() checked if the page it was about to > > pin was in ZONE_MOVABLE and if so, migrate it at that point before it's > > pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability > > if that's what they want. The CMA people might also want to take > > advantage of this if the page happened to be in the MIGRATE_CMA > > pageblock. > > > > So it may not a good idea that we all fall into calling the *non_movable* version of > GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think? Frankly speaking, I can't understand Mel's comment. AFAIUC, he said GUP checks the page before get_page and if the page is movable zone, then migrate it out of movable zone and get_page again. That's exactly what I want. It doesn't introduce GUP_NM. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <5110C28D.1040001@cn.fujitsu.com> Date: Tue, 05 Feb 2013 16:27:57 +0800 From: Lin Feng MIME-Version: 1.0 Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> <51108DC8.4090704@cn.fujitsu.com> <20130205052517.GH2610@blaptop> <5110A442.5000707@cn.fujitsu.com> <20130205074519.GB11197@blaptop> In-Reply-To: <20130205074519.GB11197@blaptop> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Hi Minchan, On 02/05/2013 03:45 PM, Minchan Kim wrote: >> So it may not a good idea that we all fall into calling the *non_movable* version of >> > GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think? > Frankly speaking, I can't understand Mel's comment. > AFAIUC, he said GUP checks the page before get_page and if the page is movable zone, > then migrate it out of movable zone and get_page again. > That's exactly what I want. It doesn't introduce GUP_NM. Since an long time pin or not is an unpredictable behave except you know what the caller wants to do. We have to check every time we call GUP, and GUP may need another parameter to teach itself to make the right decision? We have already got *8* parameters :( thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 5 Feb 2013 11:57:22 +0000 From: Mel Gorman Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130205115722.GF21389@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20130204160624.5c20a8a0.akpm@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: > On Mon, 4 Feb 2013 18:04:07 +0800 > Lin Feng wrote: > > > get_user_pages() always tries to allocate pages from movable zone, which is not > > reliable to memory hotremove framework in some case. > > > > This patch introduces a new library function called get_user_pages_non_movable() > > to pin pages only from zone non-movable in memory. > > It's a wrapper of get_user_pages() but it makes sure that all pages come from > > non-movable zone via additional page migration. > > > > ... > > > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > struct page **pages, struct vm_area_struct **vmas); > > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > > struct page **pages); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > > + unsigned long start, int nr_pages, int write, int force, > > + struct page **pages, struct vm_area_struct **vmas); > > +#endif > > The ifdefs aren't really needed here and I encourage people to omit > them. This keeps the header files looking neater and reduces the > chances of things later breaking because we forgot to update some > CONFIG_foo logic in a header file. The downside is that errors will be > revealed at link time rather than at compile time, but that's a pretty > small cost. > As an aside, if ifdefs *have* to be used then it often better to have a pattern like #ifdef CONFIG_MEMORY_HOTREMOVE int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas); #else static inline get_user_pages_non_movable(...) { get_user_pages(...) } #endif It eliminates the need for #ifdefs in the C file that calls get_user_pages_non_movable(). > > struct kvec; > > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > > struct page **pages); > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 73b64a3..5db811e 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > > return (idx == ZONE_NORMAL); > > } > > > > +static inline int is_movable(struct zone *zone) > > +{ > > + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > > +} > > A better name would be zone_is_movable(). He is matching the names of the is_normal(), is_dma32() and is_dma() helpers. Unfortunately I would expect a name like is_movable() to return true if the page had either been allocated with __GFP_MOVABLE or if it was checking migrate can currently handle the page. is_movable() is indeed a terrible name. They should be renamed or deleted in preparation -- see below. > We haven't been very > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > > And a neater implementation would be > > return zone_idx(zone) == ZONE_MOVABLE; > Yes. > All of which made me look at mmzone.h, and what I saw wasn't very nice :( > > Please review... > > > From: Andrew Morton > Subject: include/linux/mmzone.h: cleanups > > - implement zone_idx() in C to fix its references-args-twice macro bug > > - use zone_idx() in is_highmem() to remove large amounts of silly fluff. > > Cc: Lin Feng > Cc: Mel Gorman > Signed-off-by: Andrew Morton > --- > > include/linux/mmzone.h | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h > --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups > +++ a/include/linux/mmzone.h > @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by > /* > * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. > */ > -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) > +static inline enum zone_type zone_idx(struct zone *zone) > +{ > + return zone - zone->zone_pgdat->node_zones; > +} > > static inline int populated_zone(struct zone *zone) > { > @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon > static inline int is_highmem(struct zone *zone) > { > #ifdef CONFIG_HIGHMEM > - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || > - (zone_off == ZONE_MOVABLE * sizeof(*zone) && > - zone_movable_is_highmem()); > + enum zone_type idx = zone_idx(zone); > + > + return idx == ZONE_HIGHMEM || > + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); > #else > return 0; > #endif *blinks*. Ok, while I think your version looks nicer, it is reverting ddc81ed2 (remove sparse warning for mmzone.h). According to my version of gcc at least, your patch reintroduces the sar and sparse complains if run as make ARCH=i386 C=2 CF=-Wsparse-all .... this? From: Mel Gorman Subject: [PATCH] include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - implement zone_is_idx in an attempt to explain what the fluff in is_highmem is for - rename is_highmem to zone_is_highmem as preparation for zone_is_movable() - implement zone_is_movable as a helper for places that care about ZONE_MOVABLE - Remove is_normal, is_dma32 and is_dma because apparently no one cares - convert users of zone_idx(zone) == ZONE_FOO to appropriate zone_is_foo() helper - Use is_highmem_idx() instead of is_highmem for the PageHighMem implementation. Should be functionally equivalent because we only care about the zones index, not exactly what zone it is [akpm@linux-foundation.org: zone_idx() reimplementation] Signed-off-by: Andrew Morton Signed-off-by: Mel Gorman --- arch/s390/mm/init.c | 2 +- arch/tile/mm/init.c | 5 +---- arch/x86/mm/highmem_32.c | 2 +- include/linux/mmzone.h | 49 +++++++++++++++----------------------------- include/linux/page-flags.h | 2 +- kernel/power/snapshot.c | 12 +++++------ mm/page_alloc.c | 6 +++--- 7 files changed, 30 insertions(+), 48 deletions(-) diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index 81e596c..c26c321 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -231,7 +231,7 @@ int arch_add_memory(int nid, u64 start, u64 size) if (rc) return rc; for_each_zone(zone) { - if (zone_idx(zone) != ZONE_MOVABLE) { + if (!zone_is_movable(zone)) { /* Add range within existing zone limits */ zone_start_pfn = zone->zone_start_pfn; zone_end_pfn = zone->zone_start_pfn + diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c index ef29d6c..8287749 100644 --- a/arch/tile/mm/init.c +++ b/arch/tile/mm/init.c @@ -733,9 +733,6 @@ static void __init set_non_bootmem_pages_init(void) for_each_zone(z) { unsigned long start, end; int nid = z->zone_pgdat->node_id; -#ifdef CONFIG_HIGHMEM - int idx = zone_idx(z); -#endif start = z->zone_start_pfn; end = start + z->spanned_pages; @@ -743,7 +740,7 @@ static void __init set_non_bootmem_pages_init(void) start = max(start, max_low_pfn); #ifdef CONFIG_HIGHMEM - if (idx == ZONE_HIGHMEM) + if (zone_is_highmem(z)) totalhigh_pages += z->spanned_pages; #endif if (kdata_huge) { diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c index 6f31ee5..63020e7 100644 --- a/arch/x86/mm/highmem_32.c +++ b/arch/x86/mm/highmem_32.c @@ -124,7 +124,7 @@ void __init set_highmem_pages_init(void) for_each_zone(zone) { unsigned long zone_start_pfn, zone_end_pfn; - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) continue; zone_start_pfn = zone->zone_start_pfn; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a23923b..5ecef75 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -779,10 +779,20 @@ static inline int local_memory_node(int node_id) { return node_id; }; unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long); #endif +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) +{ + /* This mess avoids a potentially expensive pointer subtraction. */ + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; + return zone_off == idx * sizeof(*zone); +} + /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone->zone_pgdat->node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -810,50 +820,25 @@ static inline int is_highmem_idx(enum zone_type idx) #endif } -static inline int is_normal_idx(enum zone_type idx) -{ - return (idx == ZONE_NORMAL); -} - /** - * is_highmem - helper function to quickly check if a struct zone is a + * zone_is_highmem - helper function to quickly check if a struct zone is a * highmem zone or not. This is an attempt to keep references * to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum. * @zone - pointer to struct zone variable */ -static inline int is_highmem(struct zone *zone) +static inline bool zone_is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || - (zone_off == ZONE_MOVABLE * sizeof(*zone) && - zone_movable_is_highmem()); -#else - return 0; -#endif -} - -static inline int is_normal(struct zone *zone) -{ - return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL; -} - -static inline int is_dma32(struct zone *zone) -{ -#ifdef CONFIG_ZONE_DMA32 - return zone == zone->zone_pgdat->node_zones + ZONE_DMA32; + return zone_is_idx(zone, ZONE_HIGHMEM) || + (zone_is_idx(zone, ZONE_MOVABLE) && zone_movable_is_highmem()); #else return 0; #endif } -static inline int is_dma(struct zone *zone) +static inline bool zone_is_movable(struct zone *zone) { -#ifdef CONFIG_ZONE_DMA - return zone == zone->zone_pgdat->node_zones + ZONE_DMA; -#else - return 0; -#endif + return zone_is_idx(zone, ZONE_MOVABLE); } /* These two functions are used to setup the per zone pages min values */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index b5d1384..2f541f1 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -237,7 +237,7 @@ PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */ * Must use a macro here due to header dependency issues. page_zone() is not * available at this point. */ -#define PageHighMem(__p) is_highmem(page_zone(__p)) +#define PageHighMem(__p) is_highmem_idx(page_zonenum(__p)) #else PAGEFLAG_FALSE(HighMem) #endif diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 0de2857..cbf0da9 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -830,7 +830,7 @@ static unsigned int count_free_highmem_pages(void) unsigned int cnt = 0; for_each_populated_zone(zone) - if (is_highmem(zone)) + if (zone_is_highmem(zone)) cnt += zone_page_state(zone, NR_FREE_PAGES); return cnt; @@ -879,7 +879,7 @@ static unsigned int count_highmem_pages(void) for_each_populated_zone(zone) { unsigned long pfn, max_zone_pfn; - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) continue; mark_free_pages(zone); @@ -943,7 +943,7 @@ static unsigned int count_data_pages(void) unsigned int n = 0; for_each_populated_zone(zone) { - if (is_highmem(zone)) + if (zone_is_highmem(zone)) continue; mark_free_pages(zone); @@ -989,7 +989,7 @@ static void safe_copy_page(void *dst, struct page *s_page) static inline struct page * page_is_saveable(struct zone *zone, unsigned long pfn) { - return is_highmem(zone) ? + return zone_is_highmem(zone) ? saveable_highmem_page(zone, pfn) : saveable_page(zone, pfn); } @@ -1338,7 +1338,7 @@ int hibernate_preallocate_memory(void) size = 0; for_each_populated_zone(zone) { size += snapshot_additional_pages(zone); - if (is_highmem(zone)) + if (zone_is_highmem(zone)) highmem += zone_page_state(zone, NR_FREE_PAGES); else count += zone_page_state(zone, NR_FREE_PAGES); @@ -1481,7 +1481,7 @@ static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem) unsigned int free = alloc_normal; for_each_populated_zone(zone) - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) free += zone_page_state(zone, NR_FREE_PAGES); nr_pages += count_pages_for_highmem(nr_highmem); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7e208f0..9558341 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5136,7 +5136,7 @@ static void __setup_per_zone_wmarks(void) /* Calculate total number of !ZONE_HIGHMEM pages */ for_each_zone(zone) { - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) lowmem_pages += zone->present_pages; } @@ -5146,7 +5146,7 @@ static void __setup_per_zone_wmarks(void) spin_lock_irqsave(&zone->lock, flags); tmp = (u64)pages_min * zone->present_pages; do_div(tmp, lowmem_pages); - if (is_highmem(zone)) { + if (zone_is_highmem(zone)) { /* * __GFP_HIGH and PF_MEMALLOC allocations usually don't * need highmem pages, so cap pages_min to a small @@ -5585,7 +5585,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count) * For avoiding noise data, lru_add_drain_all() should be called * If ZONE_MOVABLE, the zone never contains unmovable pages */ - if (zone_idx(zone) == ZONE_MOVABLE) + if (zone_is_movable(zone)) return false; mt = get_pageblock_migratetype(page); if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt)) > _ > > > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -58,6 +58,8 @@ > > #include > > #include > > #include > > +#include > > +#include > > #include > > > > #include > > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > } > > EXPORT_SYMBOL(get_user_pages); > > > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > +/** > > + * It's a wrapper of get_user_pages() but it makes sure that all pages come from > > + * non-movable zone via additional page migration. > > + */ > > This needs a description of why the function exists - say something > about the requirements of memory hotplug. > > Also a few words describing how the function works would be good. > > > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > > + unsigned long start, int nr_pages, int write, int force, > > + struct page **pages, struct vm_area_struct **vmas) > > +{ > > + int ret, i, isolate_err, migrate_pre_flag; migrate_pre_flag should be bool and the name is a bit unusual. migrate_prepped? > > + LIST_HEAD(pagelist); > > + > > +retry: > > + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, > > + vmas); > > We should handle (ret < 0) here. At present the function will silently > convert an error return into "return 0", which is bad. The function > does appear to otherwise do the right thing if get_user_pages() failed, > but only due to good luck. > A BUG_ON check if we retry more than once wouldn't hurt either. It requires a broken implementation of alloc_migrate_target() but someone might "fix" it and miss this. A minor aside, it would be nice if we exited at this point if there was no populated ZONE_MOVABLE in the system. There is a movable_zone global variable already. If it was forced to be MAX_NR_ZONES before the call to find_usable_zone_for_movable() in memory initialisation we should be able to make a cheap check for it here. > > + isolate_err = 0; > > + migrate_pre_flag = 0; > > + > > + for (i = 0; i < ret; i++) { > > + if (is_movable(page_zone(pages[i]))) { This is a relatively expensive lookup. If my patch above is used then you can add a is_movable_idx helper and then this becomes if (is_movable_idx(page_zonenum(pages[i]))) which is cheaper. > > + if (!migrate_pre_flag) { > > + if (migrate_prep()) > > + goto put_page; CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never return failure. You could make this BUG_ON(migrate_prep()), remove this goto and the migrate_pre_flag check below becomes redundant. > > + migrate_pre_flag = 1; > > + } > > + > > + if (!isolate_lru_page(pages[i])) { > > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > > + page_is_file_cache(pages[i])); > > + list_add_tail(&pages[i]->lru, &pagelist); > > + } else { > > + isolate_err = 1; > > + goto put_page; > > + } isolate_lru_page() takes the LRU lock every time. If get_user_pages_non_movable is used heavily then you may encounter lock contention problems. Batching this lock would be a separate patch and should not be necessary yet but you should at least comment on it as a reminder. I think that a goto could also have been avoided here if you used break. The i == ret check below would be false and it would just fall through. Overall the flow would be a bit easier to follow. Why list_add_tail()? I don't think it's wrong but it's unusual to see list_add_tail() when list_add() is enough. > > + } > > + } > > + > > + /* All pages are non movable, we are done :) */ > > + if (i == ret && list_empty(&pagelist)) > > + return ret; > > + > > +put_page: > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > > + for (i = 0; i < ret; i++) > > + put_page(pages[i]); > > + release_pages. That comment is insufficient. There are non-obvious consequences to this logic. We are dropping pins on all pages regardless of what zone they are in. If the subsequent migration fails then we end up returning 0 with no pages pinned. The user-visible effect is that io_setup() fails for non-obvious reasons. It will return EAGAIN to userspace which will be interpreted as "The specified nr_events exceeds the user's limit of available events.". The application will either fail or potentially infinite loop if the developer interpreted EAGAIN as "try again" as opposed to "this is a permanent failure". Is that deliberate? Is it really preferable that AIO can fail to setup and the application exit just in case we want to hot-remove memory later? Should a failed migration generate a WARN_ON at least? I would think that it's better to WARN_ON_ONCE if migration fails but pin the pages as requested. If a future memory hot-remove operation fails then the warning will indicate why but applications will not fail as a result. It's a little clumsy but the memory hot-remove failure message could list what applications have pinned the pages that cannot be removed so the administrator has the option of force-killing the application. It is possible to discover what application is pinning a page from userspace but it would involve an expensive search with /proc/kpagemap > > + if (migrate_pre_flag && !isolate_err) { > > + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > > + false, MIGRATE_SYNC, MR_SYSCALL); The conversion of alloc_migrate_target is a bit problematic. It strips the __GFP_MOVABLE flag and the consequence of this is that it converts those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large number of pages, particularly if the number of get_user_pages_non_movable() increases for short-lived pins like direct IO. One way around this is to add a high_zoneidx parameter to __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask) as the high_zoneidx and create a new migration allocation function that passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE > > + /* Steal pages from non-movable zone successfully? */ > > + if (!ret) > > + goto retry; > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > page_list must be reinitialised here (or, better, at the start of the loop). > page_list should be empty if ret == 0. > Mel, what's up with migrate_pages()? Shouldn't it be removing the > pages from the list when MIGRATEPAGE_SUCCESS? The use of > list_for_each_entry_safe() suggests we used to do that... > On successful migration, the pages are put back on the LRU at the end of unmap_and_move(). On migration failure the caller is responsible for putting the pages back on the LRU with putback_lru_pages() which happens below. > > + } > > + > > + putback_lru_pages(&pagelist); > > + return 0; > > +} > > +EXPORT_SYMBOL(get_user_pages_non_movable); > > +#endif > > + > -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 5 Feb 2013 13:32:44 +0000 From: Mel Gorman Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130205133244.GH21389@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20130205115722.GF21389@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote: > > > > + migrate_pre_flag = 1; > > > + } > > > + > > > + if (!isolate_lru_page(pages[i])) { > > > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > > > + page_is_file_cache(pages[i])); > > > + list_add_tail(&pages[i]->lru, &pagelist); > > > + } else { > > > + isolate_err = 1; > > > + goto put_page; > > > + } > > isolate_lru_page() takes the LRU lock every time. Credit to Michal Hocko for bringing this up but with the number of other issues I missed that this is also broken with respect to huge page handling. hugetlbfs pages will not be on the LRU so the isolation will mess up and the migration has to be handled differently. Ordinarily hugetlbfs pages cannot be allocated from ZONE_MOVABLE but it is possible to configure it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this encounters a hugetlbfs page, it'll just blow up. The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok because the caller only cares about the physical address. Migration will also split a hugepage if it receives it but you are potentially adding tail pages to a list here and then migrating them. The split of the first page will get very confused. I'm not exactly sure what the result will be but it won't be pretty. Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled during testing? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 5 Feb 2013 13:13:04 -0800 From: Andrew Morton Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-Id: <20130205131304.82a5c4cb.akpm@linux-foundation.org> In-Reply-To: <511077E7.4040605@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <511077E7.4040605@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Lin Feng Cc: mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Tang chen , Gu Zheng On Tue, 05 Feb 2013 11:09:27 +0800 Lin Feng wrote: > > > >> struct kvec; > >> int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > >> struct page **pages); > >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > >> index 73b64a3..5db811e 100644 > >> --- a/include/linux/mmzone.h > >> +++ b/include/linux/mmzone.h > >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > >> return (idx == ZONE_NORMAL); > >> } > >> > >> +static inline int is_movable(struct zone *zone) > >> +{ > >> + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > >> +} > > > > A better name would be zone_is_movable(). We haven't been very > > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > > > Yes, zone_is_xxx() should be a better name, but there are some analogous > definition like is_dma32() and is_normal() etc, if we only use zone_is_movable() > for movable zone it will break such naming rules. > Should we update other ones in a separate patch later or just keep the old style? I do think the old names were poorly chosen. Yes, we could fix them up sometime but it's hardly a pressing issue. > > And a neater implementation would be > > > > return zone_idx(zone) == ZONE_MOVABLE; > > > OK. After your change, should we also update for other ones such as is_normal()..? Sure. From: Andrew Morton Subject: include-linux-mmzoneh-cleanups-fix use zone_idx() some more, further simplify is_highmem() Cc: Lin Feng Cc: Mel Gorman Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix +++ a/include/linux/mmzone.h @@ -859,25 +859,18 @@ static inline int is_normal_idx(enum zon */ static inline int is_highmem(struct zone *zone) { -#ifdef CONFIG_HIGHMEM - enum zone_type idx = zone_idx(zone); - - return idx == ZONE_HIGHMEM || - (idx == ZONE_MOVABLE && zone_movable_is_highmem()); -#else - return 0; -#endif + return is_highmem_idx(zone_idx(zone)); } static inline int is_normal(struct zone *zone) { - return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL; + return zone_idx(zone) == ZONE_NORMAL; } static inline int is_dma32(struct zone *zone) { #ifdef CONFIG_ZONE_DMA32 - return zone == zone->zone_pgdat->node_zones + ZONE_DMA32; + return zone_idx(zone) == ZONE_DMA32; #else return 0; #endif @@ -886,7 +879,7 @@ static inline int is_dma32(struct zone * static inline int is_dma(struct zone *zone) { #ifdef CONFIG_ZONE_DMA - return zone == zone->zone_pgdat->node_zones + ZONE_DMA; + return zone_idx(zone) == ZONE_DMA; #else return 0; #endif _ > >> + /* All pages are non movable, we are done :) */ > >> + if (i == ret && list_empty(&pagelist)) > >> + return ret; > >> + > >> +put_page: > >> + /* Undo the effects of former get_user_pages(), we won't pin anything */ > >> + for (i = 0; i < ret; i++) > >> + put_page(pages[i]); > >> + > >> + if (migrate_pre_flag && !isolate_err) { > >> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >> + false, MIGRATE_SYNC, MR_SYSCALL); > >> + /* Steal pages from non-movable zone successfully? */ > >> + if (!ret) > >> + goto retry; > > > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > > page_list must be reinitialised here (or, better, at the start of the loop). > migrate_pages() > list_for_each_entry_safe() > unmap_and_move() > if (rc != -EAGAIN) { > list_del(&page->lru); > } ah, OK, there it is. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx137.postini.com [74.125.245.137]) by kanga.kvack.org (Postfix) with SMTP id 895746B005A for ; Tue, 5 Feb 2013 21:26:52 -0500 (EST) Received: by mail-vc0-f180.google.com with SMTP id fo13so527240vcb.25 for ; Tue, 05 Feb 2013 18:26:51 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20130205115722.GF21389@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> Date: Tue, 5 Feb 2013 18:26:51 -0800 Message-ID: Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() From: Michel Lespinasse Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Just nitpicking, but: On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman wrote: > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) > +{ > + /* This mess avoids a potentially expensive pointer subtraction. */ > + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > + return zone_off == idx * sizeof(*zone); > +} Maybe: return zone == zone->zone_pgdat->node_zones + idx; ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <512203C4.8010608@cn.fujitsu.com> Date: Mon, 18 Feb 2013 18:34:44 +0800 From: Lin Feng MIME-Version: 1.0 Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> In-Reply-To: <20130205115722.GF21389@suse.de> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Hi Mel, See below. On 02/05/2013 07:57 PM, Mel Gorman wrote: > On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: >> The ifdefs aren't really needed here and I encourage people to omit >> them. This keeps the header files looking neater and reduces the >> chances of things later breaking because we forgot to update some >> CONFIG_foo logic in a header file. The downside is that errors will be >> revealed at link time rather than at compile time, but that's a pretty >> small cost. >> > > As an aside, if ifdefs *have* to be used then it often better to have a > pattern like > > #ifdef CONFIG_MEMORY_HOTREMOVE > int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > unsigned long start, int nr_pages, int write, int force, > struct page **pages, struct vm_area_struct **vmas); > #else > static inline get_user_pages_non_movable(...) > { > get_user_pages(...) > } > #endif > > It eliminates the need for #ifdefs in the C file that calls > get_user_pages_non_movable(). Thanks for pointing out :) >>> + >>> +retry: >>> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >>> + vmas); >> >> We should handle (ret < 0) here. At present the function will silently >> convert an error return into "return 0", which is bad. The function >> does appear to otherwise do the right thing if get_user_pages() failed, >> but only due to good luck. >> > > A BUG_ON check if we retry more than once wouldn't hurt either. It requires > a broken implementation of alloc_migrate_target() but someone might "fix" > it and miss this. Agree. > > A minor aside, it would be nice if we exited at this point if there was > no populated ZONE_MOVABLE in the system. There is a movable_zone global > variable already. If it was forced to be MAX_NR_ZONES before the call to > find_usable_zone_for_movable() in memory initialisation we should be able > to make a cheap check for it here. OK. >>> + if (!migrate_pre_flag) { >>> + if (migrate_prep()) >>> + goto put_page; > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > return failure. You could make this BUG_ON(migrate_prep()), remove this > goto and the migrate_pre_flag check below becomes redundant. Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check will call/check migrate_prep() every time we isolate a single page, do we have to do so? I mean that for a group of pages it's sufficient to call migrate_prep() only once. There are some other migrate-relative codes using this scheme. So I get confused, do I miss something or the the performance cost is little? :( > >>> + migrate_pre_flag = 1; >>> + } >>> + >>> + if (!isolate_lru_page(pages[i])) { >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>> + page_is_file_cache(pages[i])); >>> + list_add_tail(&pages[i]->lru, &pagelist); >>> + } else { >>> + isolate_err = 1; >>> + goto put_page; >>> + } > > isolate_lru_page() takes the LRU lock every time. If > get_user_pages_non_movable is used heavily then you may encounter lock > contention problems. Batching this lock would be a separate patch and should > not be necessary yet but you should at least comment on it as a reminder. Farsighted, should improve it in the future. > > I think that a goto could also have been avoided here if you used break. The > i == ret check below would be false and it would just fall through. > Overall the flow would be a bit easier to follow. Yes, I noticed this before. I thought using goto could save some micro ops (here is only the i == ret check) though lower the readability but still be clearly. Also there are some other place in current kernel facing such performance/readability race condition, but it seems that the developers prefer readability, why? While I think performance is fatal to kernel.. > > Why list_add_tail()? I don't think it's wrong but it's unusual to see > list_add_tail() when list_add() is enough. Sorry, not intentional, just steal from other codes without thinking more ;-) > >>> + } >>> + } >>> + >>> + /* All pages are non movable, we are done :) */ >>> + if (i == ret && list_empty(&pagelist)) >>> + return ret; >>> + >>> +put_page: >>> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> + for (i = 0; i < ret; i++) >>> + put_page(pages[i]); >>> + > > release_pages. > > That comment is insufficient. There are non-obvious consequences to this > logic. We are dropping pins on all pages regardless of what zone they > are in. If the subsequent migration fails then we end up returning 0 > with no pages pinned. The user-visible effect is that io_setup() fails > for non-obvious reasons. It will return EAGAIN to userspace which will be > interpreted as "The specified nr_events exceeds the user's limit of available > events.". The application will either fail or potentially infinite loop > if the developer interpreted EAGAIN as "try again" as opposed to "this is > a permanent failure". Yes, I missed such things. > > Is that deliberate? Is it really preferable that AIO can fail to setup > and the application exit just in case we want to hot-remove memory later? > Should a failed migration generate a WARN_ON at least? > > I would think that it's better to WARN_ON_ONCE if migration fails but pin > the pages as requested. If a future memory hot-remove operation fails > then the warning will indicate why but applications will not fail as a I'm so agree, I will keep the pin state alive and issue a WARN_ON_ONCE message in such case. > result. It's a little clumsy but the memory hot-remove failure message > could list what applications have pinned the pages that cannot be removed > so the administrator has the option of force-killing the application. It > is possible to discover what application is pinning a page from userspace > but it would involve an expensive search with /proc/kpagemap > >>> + if (migrate_pre_flag && !isolate_err) { >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > The conversion of alloc_migrate_target is a bit problematic. It strips > the __GFP_MOVABLE flag and the consequence of this is that it converts > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > number of pages, particularly if the number of get_user_pages_non_movable() > increases for short-lived pins like direct IO. Sorry, I don't quite understand here neither. If we use the following new migration allocation function as you said, the increasing number of get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE pages. What's the difference, do I miss something? > > One way around this is to add a high_zoneidx parameter to > __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new > inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask) > as the high_zoneidx and create a new migration allocation function that > passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to > happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE On 02/05/2013 09:32 PM, Mel Gorman wrote: > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? Sorry, I hadn't considered THP and hugepage yet, I will deal such cases based on your comments later. thanks a lot for your review :) linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 18 Feb 2013 15:17:16 +0000 From: Mel Gorman Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130218151716.GL4365@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <512203C4.8010608@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <512203C4.8010608@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Lin Feng Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, Feb 18, 2013 at 06:34:44PM +0800, Lin Feng wrote: > >>> + if (!migrate_pre_flag) { > >>> + if (migrate_prep()) > >>> + goto put_page; > > > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > > return failure. You could make this BUG_ON(migrate_prep()), remove this > > goto and the migrate_pre_flag check below becomes redundant. > Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check > will call/check migrate_prep() every time we isolate a single page, do we have > to do so? I was referring to the migrate_pre_flag check further down and I'm not suggesting it be removed from here as you do want to call migrate_prep() only once. However, as it'll never return failure for any kernel configuration that allows memory hot-remove, this goto can be removed and the flow simplified. > > > >>> + migrate_pre_flag = 1; > >>> + } > >>> + > >>> + if (!isolate_lru_page(pages[i])) { > >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > >>> + page_is_file_cache(pages[i])); > >>> + list_add_tail(&pages[i]->lru, &pagelist); > >>> + } else { > >>> + isolate_err = 1; > >>> + goto put_page; > >>> + } > > > > isolate_lru_page() takes the LRU lock every time. If > > get_user_pages_non_movable is used heavily then you may encounter lock > > contention problems. Batching this lock would be a separate patch and should > > not be necessary yet but you should at least comment on it as a reminder. > Farsighted, should improve it in the future. > > > > > I think that a goto could also have been avoided here if you used break. The > > i == ret check below would be false and it would just fall through. > > Overall the flow would be a bit easier to follow. > Yes, I noticed this before. I thought using goto could save some micro ops > (here is only the i == ret check) though lower the readability but still be clearly. > > Also there are some other place in current kernel facing such performance/readability > race condition, but it seems that the developers prefer readability, why? While I > think performance is fatal to kernel.. > Memory hot-remove and page migration are not performance critical paths. For page migration, the cost will be likely dominated by the page copy but it's also possible the cost will be dominated by locking. The difference between a break and goto will not even be measurable. > > > > > > result. It's a little clumsy but the memory hot-remove failure message > > could list what applications have pinned the pages that cannot be removed > > so the administrator has the option of force-killing the application. It > > is possible to discover what application is pinning a page from userspace > > but it would involve an expensive search with /proc/kpagemap > > > >>> + if (migrate_pre_flag && !isolate_err) { > >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > > > The conversion of alloc_migrate_target is a bit problematic. It strips > > the __GFP_MOVABLE flag and the consequence of this is that it converts > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > > number of pages, particularly if the number of get_user_pages_non_movable() > > increases for short-lived pins like direct IO. > > Sorry, I don't quite understand here neither. If we use the following new > migration allocation function as you said, the increasing number of > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE > pages. What's the difference, do I miss something? > The replacement function preserves the __GFP_MOVABLE flag. It cannot use ZONE_MOVABLE but otherwise the newly allocated page will be grouped with other movable pages. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 19 Feb 2013 10:34:59 +0000 From: Mel Gorman Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130219103458.GN4365@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <512203C4.8010608@cn.fujitsu.com> <20130218151716.GL4365@suse.de> <51234C12.4020404@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <51234C12.4020404@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Lin Feng Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Feb 19, 2013 at 05:55:30PM +0800, Lin Feng wrote: > Hi Mel, > > On 02/18/2013 11:17 PM, Mel Gorman wrote: > >>> > > > >>> > > > >>> > > result. It's a little clumsy but the memory hot-remove failure message > >>> > > could list what applications have pinned the pages that cannot be removed > >>> > > so the administrator has the option of force-killing the application. It > >>> > > is possible to discover what application is pinning a page from userspace > >>> > > but it would involve an expensive search with /proc/kpagemap > >>> > > > >>>>> > >>> + if (migrate_pre_flag && !isolate_err) { > >>>>> > >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >>>>> > >>> + false, MIGRATE_SYNC, MR_SYSCALL); > >>> > > > >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips > >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts > >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > >>> > > number of pages, particularly if the number of get_user_pages_non_movable() > >>> > > increases for short-lived pins like direct IO. > >> > > >> > Sorry, I don't quite understand here neither. If we use the following new > >> > migration allocation function as you said, the increasing number of > >> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE > >> > pages. What's the difference, do I miss something? > >> > > > The replacement function preserves the __GFP_MOVABLE flag. It cannot use > > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with > > other movable pages. > > Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.". > Update the comment and describe the exception then. > Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so > that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little > change? > It should work (double check gfp_zone) but then the allocation cannot use the highmem zone. If you can be 100% certain that zone will not exist be populated then it will work as expected but it's a hack and should be commented clearly. You could do a BUILD_BUG_ON if CONFIG_HIGHMEM is set to enforce it. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx162.postini.com [74.125.245.162]) by kanga.kvack.org (Postfix) with SMTP id A59EC6B0008 for ; Tue, 19 Feb 2013 21:44:48 -0500 (EST) Received: from /spool/local by e23smtp02.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 20 Feb 2013 12:38:40 +1000 Date: Wed, 20 Feb 2013 10:44:35 +0800 From: Wanpeng Li Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130220024435.GA30208@hacker.(null)> Reply-To: Wanpeng Li References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51238033.6010005@cn.fujitsu.com> <5124363C.9060604@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5124363C.9060604@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Lin Feng Cc: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Wed, Feb 20, 2013 at 10:34:36AM +0800, Lin Feng wrote: > > >On 02/19/2013 09:37 PM, Lin Feng wrote: >>> > >>> > The other is that this almost certainly broken for transhuge page >>> > handling. gup returns the head and tail pages and ordinarily this is ok >> I can't find codes doing such things :(, could you please point me out? >> >Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. >flee... According to the compound page, the first page of compound page is called head page, other sub pages are called tail pages. Regards, Wanpeng Li > >thanks, >linfeng > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <5124363C.9060604@cn.fujitsu.com> Date: Wed, 20 Feb 2013 10:34:36 +0800 From: Lin Feng MIME-Version: 1.0 Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51238033.6010005@cn.fujitsu.com> In-Reply-To: <51238033.6010005@cn.fujitsu.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On 02/19/2013 09:37 PM, Lin Feng wrote: >> > >> > The other is that this almost certainly broken for transhuge page >> > handling. gup returns the head and tail pages and ordinarily this is ok > I can't find codes doing such things :(, could you please point me out? > Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. flee... thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <51243BF7.3030004@cn.fujitsu.com> Date: Wed, 20 Feb 2013 10:59:03 +0800 From: Lin Feng MIME-Version: 1.0 Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51238033.6010005@cn.fujitsu.com> <5124363C.9060604@cn.fujitsu.com> <20130220024435.GA30208@hacker.(null)> In-Reply-To: <20130220024435.GA30208@hacker.(null)> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Wanpeng Li Cc: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Hi Wanpeng, On 02/20/2013 10:44 AM, Wanpeng Li wrote: >> Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. >> >flee... > According to the compound page, the first page of compound page is > called head page, other sub pages are called tail pages. > > Regards, > Wanpeng Li > Yes, you are right, thanks for explaining. I thought it just means only the last one of the compound pages.. thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <5124A42B.1020905@cn.fujitsu.com> Date: Wed, 20 Feb 2013 18:23:39 +0800 From: Lin Feng MIME-Version: 1.0 Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51249E3E.9070909@gmail.com> In-Reply-To: <51249E3E.9070909@gmail.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Hi Simon, On 02/20/2013 05:58 PM, Simon Jeons wrote: > >> >> The other is that this almost certainly broken for transhuge page >> handling. gup returns the head and tail pages and ordinarily this is ok > > When need gup thp? in kvm case? gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. We can't expect the pages have been allocated for user address space are thp or normal page. So we need to deal with them and I think it have nothing to do with kvm. thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <5124B991.1020302@cn.fujitsu.com> Date: Wed, 20 Feb 2013 19:54:57 +0800 From: Lin Feng MIME-Version: 1.0 Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51249E3E.9070909@gmail.com> <5124A42B.1020905@cn.fujitsu.com> <5124B42A.1020908@gmail.com> In-Reply-To: <5124B42A.1020908@gmail.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: owner-linux-mm@kvack.org List-ID: To: Simon Jeons Cc: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On 02/20/2013 07:31 PM, Simon Jeons wrote: > On 02/20/2013 06:23 PM, Lin Feng wrote: >> Hi Simon, >> >> On 02/20/2013 05:58 PM, Simon Jeons wrote: >>>> The other is that this almost certainly broken for transhuge page >>>> handling. gup returns the head and tail pages and ordinarily this is ok >>> When need gup thp? in kvm case? >> gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. >> We can't expect the pages have been allocated for user address space are thp or >> normal page. So we need to deal with them and I think it have nothing to do with kvm. > > Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present? No, userspace process doesn't pin any pages directly but through some syscalls like io_setup() indirectly for other purpose because kernel can't pagefault and it have to keep the page alive. Kernel wants to communicate with the userspace such as to notify some events so it need some sort of buffer that both Kernel and User space can both access, which leads to so called pin pages by gup. thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753390Ab3BDKFf (ORCPT ); Mon, 4 Feb 2013 05:05:35 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:3594 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1751023Ab3BDKFd (ORCPT ); Mon, 4 Feb 2013 05:05:33 -0500 X-IronPort-AV: E=Sophos;i="4.84,599,1355068800"; d="scan'208";a="6687004" From: Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng Subject: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Date: Mon, 4 Feb 2013 18:04:06 +0800 Message-Id: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> X-Mailer: git-send-email 1.7.11.7 X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/04 18:04:14, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/04 18:04:15, Serialize complete at 2013/02/04 18:04:15 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently get_user_pages() always tries to allocate pages from movable zone, as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users of get_user_pages() is easy to pin user pages for a long time(for now we found that pages pinned as aio ring pages is such case), which is fatal for memory hotplug/remove framework. So the 1st patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. The 2nd patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works when configed with CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). Lin Feng (2): mm: hotplug: implement non-movable version of get_user_pages() fs/aio.c: use non-movable version of get_user_pages() to pin ring pages when support memory hotremove fs/aio.c | 6 +++++ include/linux/mm.h | 5 ++++ include/linux/mmzone.h | 4 ++++ mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 ++++ 5 files changed, 83 insertions(+) -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753641Ab3BDKFq (ORCPT ); Mon, 4 Feb 2013 05:05:46 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:63158 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1751590Ab3BDKFf (ORCPT ); Mon, 4 Feb 2013 05:05:35 -0500 X-IronPort-AV: E=Sophos;i="4.84,599,1355068800"; d="scan'208";a="6687006" From: Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng Subject: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Date: Mon, 4 Feb 2013 18:04:08 +0800 Message-Id: <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/04 18:04:17, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/04 18:04:17, Serialize complete at 2013/02/04 18:04:17 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works as configed with CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). Cc: Benjamin LaHaise Cc: Alexander Viro Cc: Andrew Morton Reviewed-by: Tang Chen Reviewed-by: Gu Zheng Signed-off-by: Lin Feng --- fs/aio.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/aio.c b/fs/aio.c index 71f613c..0e9b30a 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) } dprintk("mmap address: 0x%08lx\n", info->mmap_base); +#ifdef CONFIG_MEMORY_HOTREMOVE + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, + info->mmap_base, nr_pages, + 1, 0, info->ring_pages, NULL); +#else info->nr_pages = get_user_pages(current, ctx->mm, info->mmap_base, nr_pages, 1, 0, info->ring_pages, NULL); +#endif up_write(&ctx->mm->mmap_sem); if (unlikely(info->nr_pages != nr_pages)) { -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753822Ab3BDKGC (ORCPT ); Mon, 4 Feb 2013 05:06:02 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:3871 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1751055Ab3BDKFe (ORCPT ); Mon, 4 Feb 2013 05:05:34 -0500 X-IronPort-AV: E=Sophos;i="4.84,599,1355068800"; d="scan'208";a="6687005" From: Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng Subject: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 4 Feb 2013 18:04:07 +0800 Message-Id: <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/04 18:04:16, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/04 18:04:16, Serialize complete at 2013/02/04 18:04:16 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org get_user_pages() always tries to allocate pages from movable zone, which is not reliable to memory hotremove framework in some case. This patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. Cc: Andrew Morton Cc: Mel Gorman Cc: KAMEZAWA Hiroyuki Cc: Yasuaki Ishimatsu Reviewed-by: Tang Chen Reviewed-by: Gu Zheng Signed-off-by: Lin Feng --- include/linux/mm.h | 5 ++++ include/linux/mmzone.h | 4 ++++ mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 ++++ 4 files changed, 77 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 66e2f7c..2a25d0e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas); +#endif struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 73b64a3..5db811e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) return (idx == ZONE_NORMAL); } +static inline int is_movable(struct zone *zone) +{ + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; +} /** * is_highmem - helper function to quickly check if a struct zone is a * highmem zone or not. This is an attempt to keep references diff --git a/mm/memory.c b/mm/memory.c index bb1369f..e3b8e19 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -58,6 +58,8 @@ #include #include #include +#include +#include #include #include @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +/** + * It's a wrapper of get_user_pages() but it makes sure that all pages come from + * non-movable zone via additional page migration. + */ +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas) +{ + int ret, i, isolate_err, migrate_pre_flag; + LIST_HEAD(pagelist); + +retry: + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, + vmas); + + isolate_err = 0; + migrate_pre_flag = 0; + + for (i = 0; i < ret; i++) { + if (is_movable(page_zone(pages[i]))) { + if (!migrate_pre_flag) { + if (migrate_prep()) + goto put_page; + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + + page_is_file_cache(pages[i])); + list_add_tail(&pages[i]->lru, &pagelist); + } else { + isolate_err = 1; + goto put_page; + } + } + } + + /* All pages are non movable, we are done :) */ + if (i == ret && list_empty(&pagelist)) + return ret; + +put_page: + /* Undo the effects of former get_user_pages(), we won't pin anything */ + for (i = 0; i < ret; i++) + put_page(pages[i]); + + if (migrate_pre_flag && !isolate_err) { + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, + false, MIGRATE_SYNC, MR_SYSCALL); + /* Steal pages from non-movable zone successfully? */ + if (!ret) + goto retry; + } + + putback_lru_pages(&pagelist); + return 0; +} +EXPORT_SYMBOL(get_user_pages_non_movable); +#endif + /** * get_dump_page() - pin user page in memory while writing it to core dump * @addr: user address diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 383bdbb..1b7bd17 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, return ret ? 0 : -EBUSY; } +/** + * @private: 0 means page can be alloced from movable zone, otherwise forbidden + */ struct page *alloc_migrate_target(struct page *page, unsigned long private, int **resultp) { @@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private, if (PageHighMem(page)) gfp_mask |= __GFP_HIGHMEM; + if (unlikely(private != 0)) + gfp_mask &= ~__GFP_MOVABLE; return alloc_page(gfp_mask); } -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756029Ab3BDPTP (ORCPT ); Mon, 4 Feb 2013 10:19:15 -0500 Received: from mx1.redhat.com ([209.132.183.28]:37804 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754900Ab3BDPTO (ORCPT ); Mon, 4 Feb 2013 10:19:14 -0500 From: Jeff Moyer To: Lin Feng Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> X-PGP-KeyID: 1F78E1B4 X-PGP-CertKey: F6FE 280D 8293 F72C 65FD 5A58 1FF8 A7CA 1F78 E1B4 X-PCLoadLetter: What the f**k does that mean? Date: Mon, 04 Feb 2013 10:18:03 -0500 In-Reply-To: <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> (Lin Feng's message of "Mon, 4 Feb 2013 18:04:08 +0800") Message-ID: User-Agent: Gnus/5.110011 (No Gnus v0.11) Emacs/23.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Lin Feng writes: > This patch gets around the aio ring pages can't be migrated bug caused by > get_user_pages() via using the new function. It only works as configed with > CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > > Cc: Benjamin LaHaise > Cc: Alexander Viro > Cc: Andrew Morton > Reviewed-by: Tang Chen > Reviewed-by: Gu Zheng > Signed-off-by: Lin Feng > --- > fs/aio.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/fs/aio.c b/fs/aio.c > index 71f613c..0e9b30a 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) > } > > dprintk("mmap address: 0x%08lx\n", info->mmap_base); > +#ifdef CONFIG_MEMORY_HOTREMOVE > + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, > + info->mmap_base, nr_pages, > + 1, 0, info->ring_pages, NULL); > +#else > info->nr_pages = get_user_pages(current, ctx->mm, > info->mmap_base, nr_pages, > 1, 0, info->ring_pages, NULL); > +#endif Can't you hide this in your 1/1 patch, by providing this function as just a static inline wrapper around get_user_pages when CONFIG_MEMORY_HOTREMOVE is not enabled? Cheers, Jeff From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756212Ab3BDXDU (ORCPT ); Mon, 4 Feb 2013 18:03:20 -0500 Received: from mx1.redhat.com ([209.132.183.28]:58607 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753943Ab3BDXDS (ORCPT ); Mon, 4 Feb 2013 18:03:18 -0500 Date: Mon, 4 Feb 2013 15:02:09 -0800 From: Zach Brown To: Jeff Moyer Cc: Lin Feng , akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Message-ID: <20130204230209.GK14246@lenny.home.zabbo.net> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > index 71f613c..0e9b30a 100644 > > --- a/fs/aio.c > > +++ b/fs/aio.c > > @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) > > } > > > > dprintk("mmap address: 0x%08lx\n", info->mmap_base); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, > > + info->mmap_base, nr_pages, > > + 1, 0, info->ring_pages, NULL); > > +#else > > info->nr_pages = get_user_pages(current, ctx->mm, > > info->mmap_base, nr_pages, > > 1, 0, info->ring_pages, NULL); > > +#endif > > Can't you hide this in your 1/1 patch, by providing this function as > just a static inline wrapper around get_user_pages when > CONFIG_MEMORY_HOTREMOVE is not enabled? Yes, please. Having callers duplicate the call site for a single optional boolean input is unacceptable. But do we want another input argument as a name? Should aio have been using get_user_pages_fast()? (and so now _fast_non_movable?) I wonder if it's time to offer the booleans as a _flags() variant, much like the current internal flags for __get_user_pages(). The write and force arguments are already booleans, we have a different fast api, and now we're adding non-movable. The NON_MOVABLE flag would be 0 without MEMORY_HOTREMOVE, easy peasy. Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might also be nice :). No? - z From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755651Ab3BEAG2 (ORCPT ); Mon, 4 Feb 2013 19:06:28 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:49605 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754458Ab3BEAG0 (ORCPT ); Mon, 4 Feb 2013 19:06:26 -0500 Date: Mon, 4 Feb 2013 16:06:24 -0800 From: Andrew Morton To: Lin Feng Cc: mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-Id: <20130204160624.5c20a8a0.akpm@linux-foundation.org> In-Reply-To: <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org melreadthis On Mon, 4 Feb 2013 18:04:07 +0800 Lin Feng wrote: > get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > > This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > ... > > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > struct page **pages, struct vm_area_struct **vmas); > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages); > +#ifdef CONFIG_MEMORY_HOTREMOVE > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, int nr_pages, int write, int force, > + struct page **pages, struct vm_area_struct **vmas); > +#endif The ifdefs aren't really needed here and I encourage people to omit them. This keeps the header files looking neater and reduces the chances of things later breaking because we forgot to update some CONFIG_foo logic in a header file. The downside is that errors will be revealed at link time rather than at compile time, but that's a pretty small cost. > struct kvec; > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > struct page **pages); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 73b64a3..5db811e 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > return (idx == ZONE_NORMAL); > } > > +static inline int is_movable(struct zone *zone) > +{ > + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > +} A better name would be zone_is_movable(). We haven't been very consistent about this in mmzone.h, but zone_is_foo() is pretty common. And a neater implementation would be return zone_idx(zone) == ZONE_MOVABLE; All of which made me look at mmzone.h, and what I saw wasn't very nice :( Please review... From: Andrew Morton Subject: include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - use zone_idx() in is_highmem() to remove large amounts of silly fluff. Cc: Lin Feng Cc: Mel Gorman Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups +++ a/include/linux/mmzone.h @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone->zone_pgdat->node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon static inline int is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || - (zone_off == ZONE_MOVABLE * sizeof(*zone) && - zone_movable_is_highmem()); + enum zone_type idx = zone_idx(zone); + + return idx == ZONE_HIGHMEM || + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); #else return 0; #endif _ > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -58,6 +58,8 @@ > #include > #include > #include > +#include > +#include > #include > > #include > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > } > EXPORT_SYMBOL(get_user_pages); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > +/** > + * It's a wrapper of get_user_pages() but it makes sure that all pages come from > + * non-movable zone via additional page migration. > + */ This needs a description of why the function exists - say something about the requirements of memory hotplug. Also a few words describing how the function works would be good. > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, int nr_pages, int write, int force, > + struct page **pages, struct vm_area_struct **vmas) > +{ > + int ret, i, isolate_err, migrate_pre_flag; > + LIST_HEAD(pagelist); > + > +retry: > + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, > + vmas); We should handle (ret < 0) here. At present the function will silently convert an error return into "return 0", which is bad. The function does appear to otherwise do the right thing if get_user_pages() failed, but only due to good luck. > + isolate_err = 0; > + migrate_pre_flag = 0; > + > + for (i = 0; i < ret; i++) { > + if (is_movable(page_zone(pages[i]))) { > + if (!migrate_pre_flag) { > + if (migrate_prep()) > + goto put_page; > + migrate_pre_flag = 1; > + } > + > + if (!isolate_lru_page(pages[i])) { > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > + page_is_file_cache(pages[i])); > + list_add_tail(&pages[i]->lru, &pagelist); > + } else { > + isolate_err = 1; > + goto put_page; > + } > + } > + } > + > + /* All pages are non movable, we are done :) */ > + if (i == ret && list_empty(&pagelist)) > + return ret; > + > +put_page: > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > + for (i = 0; i < ret; i++) > + put_page(pages[i]); > + > + if (migrate_pre_flag && !isolate_err) { > + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > + false, MIGRATE_SYNC, MR_SYSCALL); > + /* Steal pages from non-movable zone successfully? */ > + if (!ret) > + goto retry; This is buggy. migrate_pages() doesn't empty its `from' argument, so page_list must be reinitialised here (or, better, at the start of the loop). Mel, what's up with migrate_pages()? Shouldn't it be removing the pages from the list when MIGRATEPAGE_SUCCESS? The use of list_for_each_entry_safe() suggests we used to do that... > + } > + > + putback_lru_pages(&pagelist); > + return 0; > +} > +EXPORT_SYMBOL(get_user_pages_non_movable); > +#endif > + Generally, I'd like Mel to go through (the next version of) this patch carefully, please. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756223Ab3BEASv (ORCPT ); Mon, 4 Feb 2013 19:18:51 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:49764 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755968Ab3BEASu (ORCPT ); Mon, 4 Feb 2013 19:18:50 -0500 Date: Mon, 4 Feb 2013 16:18:48 -0800 From: Andrew Morton To: Lin Feng , mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-Id: <20130204161848.bfd36176.akpm@linux-foundation.org> In-Reply-To: <20130204160624.5c20a8a0.akpm@linux-foundation.org> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Also... On Mon, 4 Feb 2013 16:06:24 -0800 Andrew Morton wrote: > > +put_page: > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > > + for (i = 0; i < ret; i++) > > + put_page(pages[i]); We can use release_pages() here. release_pages() is designed to be more efficient when we're putting the final reference to (most of) the pages. It probably has little if any benefit when putting still-in-use pages, as we're doing here. But please consider... From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755498Ab3BEA7E (ORCPT ); Mon, 4 Feb 2013 19:59:04 -0500 Received: from LGEMRELSE6Q.lge.com ([156.147.1.121]:48612 "EHLO LGEMRELSE6Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751215Ab3BEA7C (ORCPT ); Mon, 4 Feb 2013 19:59:02 -0500 X-AuditID: 9c930179-b7c24ae00000119c-02-51105953d620 Date: Tue, 5 Feb 2013 09:58:59 +0900 From: Minchan Kim To: Lin Feng Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Message-ID: <20130205005859.GE2610@blaptop> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > Currently get_user_pages() always tries to allocate pages from movable zone, > as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > of get_user_pages() is easy to pin user pages for a long time(for now we found > that pages pinned as aio ring pages is such case), which is fatal for memory > hotplug/remove framework. > > So the 1st patch introduces a new library function called > get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > The 2nd patch gets around the aio ring pages can't be migrated bug caused by > get_user_pages() via using the new function. It only works when configed with > CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). CMA has same issue but the problem is the driver developers or any subsystem using GUP can't know their pages is in CMA area or not in advance. So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? Even some driver module in embedded side doesn't open their source code. I would like to make GUP smart so it allocates a page from non-movable/non-cma area when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :( > > Lin Feng (2): > mm: hotplug: implement non-movable version of get_user_pages() > fs/aio.c: use non-movable version of get_user_pages() to pin ring > pages when support memory hotremove > > fs/aio.c | 6 +++++ > include/linux/mm.h | 5 ++++ > include/linux/mmzone.h | 4 ++++ > mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/page_isolation.c | 5 ++++ > 5 files changed, 83 insertions(+) > > -- > 1.7.11.7 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756103Ab3BEDKz (ORCPT ); Mon, 4 Feb 2013 22:10:55 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:53718 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1754376Ab3BEDKy (ORCPT ); Mon, 4 Feb 2013 22:10:54 -0500 X-IronPort-AV: E=Sophos;i="4.84,603,1355068800"; d="scan'208";a="6689310" Message-ID: <511077E7.4040605@cn.fujitsu.com> Date: Tue, 05 Feb 2013 11:09:27 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Andrew Morton CC: mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Tang chen , Gu Zheng Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> In-Reply-To: <20130204160624.5c20a8a0.akpm@linux-foundation.org> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 11:09:32, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 11:09:34, Serialize complete at 2013/02/05 11:09:34 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Andrew, On 02/05/2013 08:06 AM, Andrew Morton wrote: > > melreadthis > > On Mon, 4 Feb 2013 18:04:07 +0800 > Lin Feng wrote: > >> get_user_pages() always tries to allocate pages from movable zone, which is not >> reliable to memory hotremove framework in some case. >> >> This patch introduces a new library function called get_user_pages_non_movable() >> to pin pages only from zone non-movable in memory. >> It's a wrapper of get_user_pages() but it makes sure that all pages come from >> non-movable zone via additional page migration. >> >> ... >> >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, >> struct page **pages, struct vm_area_struct **vmas); >> int get_user_pages_fast(unsigned long start, int nr_pages, int write, >> struct page **pages); >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >> + unsigned long start, int nr_pages, int write, int force, >> + struct page **pages, struct vm_area_struct **vmas); >> +#endif > > The ifdefs aren't really needed here and I encourage people to omit > them. This keeps the header files looking neater and reduces the > chances of things later breaking because we forgot to update some > CONFIG_foo logic in a header file. The downside is that errors will be > revealed at link time rather than at compile time, but that's a pretty > small cost. OK, got it, thanks :) > >> struct kvec; >> int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, >> struct page **pages); >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 73b64a3..5db811e 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) >> return (idx == ZONE_NORMAL); >> } >> >> +static inline int is_movable(struct zone *zone) >> +{ >> + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; >> +} > > A better name would be zone_is_movable(). We haven't been very > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > Yes, zone_is_xxx() should be a better name, but there are some analogous definition like is_dma32() and is_normal() etc, if we only use zone_is_movable() for movable zone it will break such naming rules. Should we update other ones in a separate patch later or just keep the old style? > And a neater implementation would be > > return zone_idx(zone) == ZONE_MOVABLE; > OK. After your change, should we also update for other ones such as is_normal()..? > All of which made me look at mmzone.h, and what I saw wasn't very nice :( > > Please review... > > > From: Andrew Morton > Subject: include/linux/mmzone.h: cleanups > > - implement zone_idx() in C to fix its references-args-twice macro bug > > - use zone_idx() in is_highmem() to remove large amounts of silly fluff. > > Cc: Lin Feng > Cc: Mel Gorman > Signed-off-by: Andrew Morton > --- > > include/linux/mmzone.h | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h > --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups > +++ a/include/linux/mmzone.h > @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by > /* > * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. > */ > -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) > +static inline enum zone_type zone_idx(struct zone *zone) > +{ > + return zone - zone->zone_pgdat->node_zones; > +} > > static inline int populated_zone(struct zone *zone) > { > @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon > static inline int is_highmem(struct zone *zone) > { > #ifdef CONFIG_HIGHMEM > - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || > - (zone_off == ZONE_MOVABLE * sizeof(*zone) && > - zone_movable_is_highmem()); > + enum zone_type idx = zone_idx(zone); > + > + return idx == ZONE_HIGHMEM || > + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); > #else > return 0; > #endif > _ > > >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -58,6 +58,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> #include >> >> #include >> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, >> } >> EXPORT_SYMBOL(get_user_pages); >> >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> +/** >> + * It's a wrapper of get_user_pages() but it makes sure that all pages come from >> + * non-movable zone via additional page migration. >> + */ > > This needs a description of why the function exists - say something > about the requirements of memory hotplug. > > Also a few words describing how the function works would be good. OK, I will add them in next version. > >> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >> + unsigned long start, int nr_pages, int write, int force, >> + struct page **pages, struct vm_area_struct **vmas) >> +{ >> + int ret, i, isolate_err, migrate_pre_flag; >> + LIST_HEAD(pagelist); >> + >> +retry: >> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >> + vmas); > > We should handle (ret < 0) here. At present the function will silently > convert an error return into "return 0", which is bad. The function > does appear to otherwise do the right thing if get_user_pages() failed, > but only due to good luck. Sorry, I forgot this, thanks. > >> + isolate_err = 0; >> + migrate_pre_flag = 0; >> + >> + for (i = 0; i < ret; i++) { >> + if (is_movable(page_zone(pages[i]))) { >> + if (!migrate_pre_flag) { >> + if (migrate_prep()) >> + goto put_page; >> + migrate_pre_flag = 1; >> + } >> + >> + if (!isolate_lru_page(pages[i])) { >> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >> + page_is_file_cache(pages[i])); >> + list_add_tail(&pages[i]->lru, &pagelist); >> + } else { >> + isolate_err = 1; >> + goto put_page; >> + } >> + } >> + } >> + >> + /* All pages are non movable, we are done :) */ >> + if (i == ret && list_empty(&pagelist)) >> + return ret; >> + >> +put_page: >> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >> + for (i = 0; i < ret; i++) >> + put_page(pages[i]); >> + >> + if (migrate_pre_flag && !isolate_err) { >> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >> + false, MIGRATE_SYNC, MR_SYSCALL); >> + /* Steal pages from non-movable zone successfully? */ >> + if (!ret) >> + goto retry; > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > page_list must be reinitialised here (or, better, at the start of the loop). migrate_pages() list_for_each_entry_safe() unmap_and_move() if (rc != -EAGAIN) { list_del(&page->lru); } IUUC, the page migrated successfully will be remove from the `from` list here. So if all pages having been migrated successlly, the list will be empty, and we goto retry. Otherwise the rest of the immigrated pages will be handled later by putback_lru_pages(&pagelist), and the function just return. Also there are many places using such logic: 1274 nr_failed = migrate_pages(&pagelist, new_vma_page, 1275 (unsigned long)vma, 1276 false, MIGRATE_SYNC, 1277 MR_MEMPOLICY_MBIND); 1278 if (nr_failed) 1279 putback_lru_pages(&pagelist); Since we traverse the list and handle each page separately, It's likely that we have migrated some page successfully before we encounter a failure. If the pagelist is still carrying the successfully migrated pages when migrate_pages() return, such code is bugyy. > > Mel, what's up with migrate_pages()? Shouldn't it be removing the > pages from the list when MIGRATEPAGE_SUCCESS? The use of > list_for_each_entry_safe() suggests we used to do that... > >> + } >> + >> + putback_lru_pages(&pagelist); >> + return 0; >> +} >> +EXPORT_SYMBOL(get_user_pages_non_movable); >> +#endif >> + > > Generally, I'd like Mel to go through (the next version of) this patch > carefully, please. > > On 02/05/2013 08:18 AM, Andrew Morton wrote:> On Mon, 4 Feb 2013 16:06:24 -0800 > Andrew Morton wrote: > >>> > > +put_page: >>> > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> > > + for (i = 0; i < ret; i++) >>> > > + put_page(pages[i]); > We can use release_pages() here. > > release_pages() is designed to be more efficient when we're putting the > final reference to (most of) the pages. It probably has little if any > benefit when putting still-in-use pages, as we're doing here. > > But please consider... OK, I will try :) > thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756159Ab3BEEoQ (ORCPT ); Mon, 4 Feb 2013 23:44:16 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:17568 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1753464Ab3BEEoP (ORCPT ); Mon, 4 Feb 2013 23:44:15 -0500 X-IronPort-AV: E=Sophos;i="4.84,603,1355068800"; d="scan'208";a="6689739" Message-ID: <51108DC8.4090704@cn.fujitsu.com> Date: Tue, 05 Feb 2013 12:42:48 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Minchan Kim CC: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> In-Reply-To: <20130205005859.GE2610@blaptop> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 12:42:52, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 12:42:54, Serialize complete at 2013/02/05 12:42:54 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Minchan, On 02/05/2013 08:58 AM, Minchan Kim wrote: > Hello, > > On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: >> Currently get_user_pages() always tries to allocate pages from movable zone, >> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users >> of get_user_pages() is easy to pin user pages for a long time(for now we found >> that pages pinned as aio ring pages is such case), which is fatal for memory >> hotplug/remove framework. >> >> So the 1st patch introduces a new library function called >> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. >> It's a wrapper of get_user_pages() but it makes sure that all pages come from >> non-movable zone via additional page migration. >> >> The 2nd patch gets around the aio ring pages can't be migrated bug caused by >> get_user_pages() via using the new function. It only works when configed with >> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > > CMA has same issue but the problem is the driver developers or any subsystem > using GUP can't know their pages is in CMA area or not in advance. > So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? > Even some driver module in embedded side doesn't open their source code. Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users of GUP, they will release the pinned pages immediately and to such users they should get a good performance, using the old style interface is a smart way. And we had better just deal with the cases we have to by using the new interface. > > I would like to make GUP smart so it allocates a page from non-movable/non-cma area > when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP > performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :( As I debuged the get_user_pages(), I found that some pages is already there and may be allocated before we call get_user_pages(). __get_user_pages() have following logic to handle such case. 1786 while (!(page = follow_page(vma, start, foll_flags))) { 1787 int ret; To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP as smart as we want :( , so I worked out the migration approach to get around and avoid messing up the current code :) thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750883Ab3BEFHz (ORCPT ); Tue, 5 Feb 2013 00:07:55 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:43954 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1750759Ab3BEFHy (ORCPT ); Tue, 5 Feb 2013 00:07:54 -0500 X-IronPort-AV: E=Sophos;i="4.84,603,1355068800"; d="scan'208";a="6690059" Message-ID: <51109352.9070401@cn.fujitsu.com> Date: Tue, 05 Feb 2013 13:06:26 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Jeff Moyer CC: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Tang chen , Gu Zheng Subject: Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> In-Reply-To: X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 13:06:30, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 13:06:34, Serialize complete at 2013/02/05 13:06:34 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Jeff, On 02/04/2013 11:18 PM, Jeff Moyer wrote: >> --- >> fs/aio.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/fs/aio.c b/fs/aio.c >> index 71f613c..0e9b30a 100644 >> --- a/fs/aio.c >> +++ b/fs/aio.c >> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) >> } >> >> dprintk("mmap address: 0x%08lx\n", info->mmap_base); >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, >> + info->mmap_base, nr_pages, >> + 1, 0, info->ring_pages, NULL); >> +#else >> info->nr_pages = get_user_pages(current, ctx->mm, >> info->mmap_base, nr_pages, >> 1, 0, info->ring_pages, NULL); >> +#endif > > Can't you hide this in your 1/1 patch, by providing this function as > just a static inline wrapper around get_user_pages when > CONFIG_MEMORY_HOTREMOVE is not enabled? Good idea, it makes the callers more neatly :) thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751156Ab3BEFZV (ORCPT ); Tue, 5 Feb 2013 00:25:21 -0500 Received: from LGEMRELSE1Q.lge.com ([156.147.1.111]:56969 "EHLO LGEMRELSE1Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750786Ab3BEFZT (ORCPT ); Tue, 5 Feb 2013 00:25:19 -0500 X-AuditID: 9c93016f-b7b70ae000000e36-df-511097bddd3f Date: Tue, 5 Feb 2013 14:25:17 +0900 From: Minchan Kim To: Lin Feng Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Message-ID: <20130205052517.GH2610@blaptop> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> <51108DC8.4090704@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51108DC8.4090704@cn.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Lin, On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote: > Hi Minchan, > > On 02/05/2013 08:58 AM, Minchan Kim wrote: > > Hello, > > > > On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > >> Currently get_user_pages() always tries to allocate pages from movable zone, > >> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > >> of get_user_pages() is easy to pin user pages for a long time(for now we found > >> that pages pinned as aio ring pages is such case), which is fatal for memory > >> hotplug/remove framework. > >> > >> So the 1st patch introduces a new library function called > >> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > >> It's a wrapper of get_user_pages() but it makes sure that all pages come from > >> non-movable zone via additional page migration. > >> > >> The 2nd patch gets around the aio ring pages can't be migrated bug caused by > >> get_user_pages() via using the new function. It only works when configed with > >> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > > > > CMA has same issue but the problem is the driver developers or any subsystem > > using GUP can't know their pages is in CMA area or not in advance. > > So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? > > Even some driver module in embedded side doesn't open their source code. > Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users > of GUP, they will release the pinned pages immediately and to such users they should get > a good performance, using the old style interface is a smart way. And we had better just > deal with the cases we have to by using the new interface. Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know in advance they will use pinned pages long time or release in short time because it depends on some event like user's response which is very not predetermined. So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of failing migration by page count and then, investigate they are really using GUP and it's REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? Yes. it could be done but it would be rather trobulesome job. > > > > > I would like to make GUP smart so it allocates a page from non-movable/non-cma area > > when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP > > performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :( > As I debuged the get_user_pages(), I found that some pages is already there and may be > allocated before we call get_user_pages(). __get_user_pages() have following logic to > handle such case. > 1786 while (!(page = follow_page(vma, start, foll_flags))) { > 1787 int ret; > To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP > as smart as we want :( , so I worked out the migration approach to get around and > avoid messing up the current code :) I didn't look at your code in detail yet but What's wrong following this? Change curret GUP with old_GUP. #ifdef CONFIG_MIGRATE_ISOLATE int get_user_pages_non_movable() { .. old_get_user_pages() .. } int get_user_pages() { return get_user_pages_non_movable(); } #else int get_user_pages() { return old_get_user_pages() } #endif > > thanks, > linfeng > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751667Ab3BEFhU (ORCPT ); Tue, 5 Feb 2013 00:37:20 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:34855 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1751072Ab3BEFhR (ORCPT ); Tue, 5 Feb 2013 00:37:17 -0500 X-IronPort-AV: E=Sophos;i="4.84,603,1355068800"; d="scan'208";a="6690185" Message-ID: <51109A38.7050605@cn.fujitsu.com> Date: Tue, 05 Feb 2013 13:35:52 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Zach Brown CC: Jeff Moyer , akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Tang chen , Gu Zheng Subject: Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-3-git-send-email-linfeng@cn.fujitsu.com> <20130204230209.GK14246@lenny.home.zabbo.net> In-Reply-To: <20130204230209.GK14246@lenny.home.zabbo.net> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 13:35:57, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 13:35:58, Serialize complete at 2013/02/05 13:35:58 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Zach, On 02/05/2013 07:02 AM, Zach Brown wrote: >>> index 71f613c..0e9b30a 100644 >>> --- a/fs/aio.c >>> +++ b/fs/aio.c >>> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) >>> } >>> >>> dprintk("mmap address: 0x%08lx\n", info->mmap_base); >>> +#ifdef CONFIG_MEMORY_HOTREMOVE >>> + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, >>> + info->mmap_base, nr_pages, >>> + 1, 0, info->ring_pages, NULL); >>> +#else >>> info->nr_pages = get_user_pages(current, ctx->mm, >>> info->mmap_base, nr_pages, >>> 1, 0, info->ring_pages, NULL); >>> +#endif >> >> Can't you hide this in your 1/1 patch, by providing this function as >> just a static inline wrapper around get_user_pages when >> CONFIG_MEMORY_HOTREMOVE is not enabled? > > Yes, please. Having callers duplicate the call site for a single > optional boolean input is unacceptable. I will deal with it in next version :) > > But do we want another input argument as a name? Should aio have been > using get_user_pages_fast()? (and so now _fast_non_movable?) > > I wonder if it's time to offer the booleans as a _flags() variant, much > like the current internal flags for __get_user_pages(). The write and > force arguments are already booleans, we have a different fast api, and > now we're adding non-movable. The NON_MOVABLE flag would be 0 without > MEMORY_HOTREMOVE, easy peasy. As my next reply-mail mentioned, IIUC in GUP case additional flags seems doesn't work, I abstract here: As I debuged the get_user_pages(), I found that some pages is already there and may be allocated before we call get_user_pages(). __get_user_pages() have following logic to handle such case. 1786 while (!(page = follow_page(vma, start, foll_flags))) { 1787 int ret; To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP as smart as we want , so I worked out the migration approach to get around and avoid messing up the current code. And even worse we have already got *8* arguments...Maybe we have to rework the boolean arguments into bit flags... It seems not a little work :( > > Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might > also be nice :). Agree, maybe we could handle them later :) thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751148Ab3BEGfQ (ORCPT ); Tue, 5 Feb 2013 01:35:16 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:3961 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1750747Ab3BEGfN (ORCPT ); Tue, 5 Feb 2013 01:35:13 -0500 X-IronPort-AV: E=Sophos;i="4.84,603,1355068800"; d="scan'208";a="6690500" Message-ID: <5110A442.5000707@cn.fujitsu.com> Date: Tue, 05 Feb 2013 14:18:42 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Minchan Kim CC: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> <51108DC8.4090704@cn.fujitsu.com> <20130205052517.GH2610@blaptop> In-Reply-To: <20130205052517.GH2610@blaptop> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 14:18:46, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 14:33:53, Serialize complete at 2013/02/05 14:33:53 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/05/2013 01:25 PM, Minchan Kim wrote: > Hi Lin, > > On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote: >> Hi Minchan, >> >> On 02/05/2013 08:58 AM, Minchan Kim wrote: >>> Hello, >>> >>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: >>>> Currently get_user_pages() always tries to allocate pages from movable zone, >>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users >>>> of get_user_pages() is easy to pin user pages for a long time(for now we found >>>> that pages pinned as aio ring pages is such case), which is fatal for memory >>>> hotplug/remove framework. >>>> >>>> So the 1st patch introduces a new library function called >>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. >>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from >>>> non-movable zone via additional page migration. >>>> >>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by >>>> get_user_pages() via using the new function. It only works when configed with >>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). >>> >>> CMA has same issue but the problem is the driver developers or any subsystem >>> using GUP can't know their pages is in CMA area or not in advance. >>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? >>> Even some driver module in embedded side doesn't open their source code. >> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users >> of GUP, they will release the pinned pages immediately and to such users they should get >> a good performance, using the old style interface is a smart way. And we had better just >> deal with the cases we have to by using the new interface. > > Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages > immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power > by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know > in advance they will use pinned pages long time or release in short time because it depends > on some event like user's response which is very not predetermined. > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > failing migration by page count and then, investigate they are really using GUP and it's > REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? > > Yes. it could be done but it would be rather trobulesome job. Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is a big job. If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the *non_movable* version of GUP. But in some server environment we want to keep the performance but also want to use hotremove feature in case. Maybe patch the place as we need is a trade off for such support. Mel also said in the last discussion: On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote: >> Well, that's a fairly low-level implementation detail. A more typical >> approach would be to add a new get_user_pages_non_movable() or such. >> That would probably have the same signature as get_user_pages(), with >> one additional argument. Then get_user_pages() becomes a one-line >> wrapper which passes in a particular value of that argument. >> > > That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE > allocations. That will impact THP availability by increasing the number > of MIGRATE_UNMOVABLE blocks that exist and it would hit every user -- > not just those that care about ZONE_MOVABLE. > > I'm likely to NAK such a patch if it's only about node hot-remove because > it's much more of a corner case than wanting to use THP. > > I would prefer if get_user_pages() checked if the page it was about to > pin was in ZONE_MOVABLE and if so, migrate it at that point before it's > pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability > if that's what they want. The CMA people might also want to take > advantage of this if the page happened to be in the MIGRATE_CMA > pageblock. > So it may not a good idea that we all fall into calling the *non_movable* version of GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think? > #ifdef CONFIG_MIGRATE_ISOLATE > > int get_user_pages_non_movable() > { > .. > old_get_user_pages() > .. > } > > int get_user_pages() > { > return get_user_pages_non_movable(); > } > #else > int get_user_pages() > { > return old_get_user_pages() > } > #endif thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754055Ab3BEHpW (ORCPT ); Tue, 5 Feb 2013 02:45:22 -0500 Received: from LGEMRELSE1Q.lge.com ([156.147.1.111]:50910 "EHLO LGEMRELSE1Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751147Ab3BEHpV (ORCPT ); Tue, 5 Feb 2013 02:45:21 -0500 X-AuditID: 9c93016f-b7b70ae000000e36-a7-5110b88ff586 Date: Tue, 5 Feb 2013 16:45:19 +0900 From: Minchan Kim To: Lin Feng Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Message-ID: <20130205074519.GB11197@blaptop> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> <51108DC8.4090704@cn.fujitsu.com> <20130205052517.GH2610@blaptop> <5110A442.5000707@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5110A442.5000707@cn.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 05, 2013 at 02:18:42PM +0800, Lin Feng wrote: > > > On 02/05/2013 01:25 PM, Minchan Kim wrote: > > Hi Lin, > > > > On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote: > >> Hi Minchan, > >> > >> On 02/05/2013 08:58 AM, Minchan Kim wrote: > >>> Hello, > >>> > >>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > >>>> Currently get_user_pages() always tries to allocate pages from movable zone, > >>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > >>>> of get_user_pages() is easy to pin user pages for a long time(for now we found > >>>> that pages pinned as aio ring pages is such case), which is fatal for memory > >>>> hotplug/remove framework. > >>>> > >>>> So the 1st patch introduces a new library function called > >>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > >>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from > >>>> non-movable zone via additional page migration. > >>>> > >>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by > >>>> get_user_pages() via using the new function. It only works when configed with > >>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > >>> > >>> CMA has same issue but the problem is the driver developers or any subsystem > >>> using GUP can't know their pages is in CMA area or not in advance. > >>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? > >>> Even some driver module in embedded side doesn't open their source code. > >> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users > >> of GUP, they will release the pinned pages immediately and to such users they should get > >> a good performance, using the old style interface is a smart way. And we had better just > >> deal with the cases we have to by using the new interface. > > > > Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages > > immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power > > by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know > > in advance they will use pinned pages long time or release in short time because it depends > > on some event like user's response which is very not predetermined. > > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > > failing migration by page count and then, investigate they are really using GUP and it's > > REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? > > > > Yes. it could be done but it would be rather trobulesome job. > Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is > a big job. > If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the > *non_movable* version of GUP. > But in some server environment we want to keep the performance but also want to use hotremove > feature in case. Maybe patch the place as we need is a trade off for such support. > > Mel also said in the last discussion: > > On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote: > >> Well, that's a fairly low-level implementation detail. A more typical > >> approach would be to add a new get_user_pages_non_movable() or such. > >> That would probably have the same signature as get_user_pages(), with > >> one additional argument. Then get_user_pages() becomes a one-line > >> wrapper which passes in a particular value of that argument. > >> > > > > That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE > > allocations. That will impact THP availability by increasing the number > > of MIGRATE_UNMOVABLE blocks that exist and it would hit every user -- > > not just those that care about ZONE_MOVABLE. > > > > I'm likely to NAK such a patch if it's only about node hot-remove because > > it's much more of a corner case than wanting to use THP. > > > > I would prefer if get_user_pages() checked if the page it was about to > > pin was in ZONE_MOVABLE and if so, migrate it at that point before it's > > pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability > > if that's what they want. The CMA people might also want to take > > advantage of this if the page happened to be in the MIGRATE_CMA > > pageblock. > > > > So it may not a good idea that we all fall into calling the *non_movable* version of > GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think? Frankly speaking, I can't understand Mel's comment. AFAIUC, he said GUP checks the page before get_page and if the page is movable zone, then migrate it out of movable zone and get_page again. That's exactly what I want. It doesn't introduce GUP_NM. -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753872Ab3BEJA0 (ORCPT ); Tue, 5 Feb 2013 04:00:26 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:37699 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1751907Ab3BEJAZ (ORCPT ); Tue, 5 Feb 2013 04:00:25 -0500 X-IronPort-AV: E=Sophos;i="4.84,603,1355068800"; d="scan'208";a="6691571" Message-ID: <5110C28D.1040001@cn.fujitsu.com> Date: Tue, 05 Feb 2013 16:27:57 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Minchan Kim CC: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <20130205005859.GE2610@blaptop> <51108DC8.4090704@cn.fujitsu.com> <20130205052517.GH2610@blaptop> <5110A442.5000707@cn.fujitsu.com> <20130205074519.GB11197@blaptop> In-Reply-To: <20130205074519.GB11197@blaptop> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 16:28:01, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 16:28:08, Serialize complete at 2013/02/05 16:28:08 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Minchan, On 02/05/2013 03:45 PM, Minchan Kim wrote: >> So it may not a good idea that we all fall into calling the *non_movable* version of >> > GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think? > Frankly speaking, I can't understand Mel's comment. > AFAIUC, he said GUP checks the page before get_page and if the page is movable zone, > then migrate it out of movable zone and get_page again. > That's exactly what I want. It doesn't introduce GUP_NM. Since an long time pin or not is an unpredictable behave except you know what the caller wants to do. We have to check every time we call GUP, and GUP may need another parameter to teach itself to make the right decision? We have already got *8* parameters :( thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754175Ab3BEL5d (ORCPT ); Tue, 5 Feb 2013 06:57:33 -0500 Received: from cantor2.suse.de ([195.135.220.15]:46224 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751211Ab3BEL5a (ORCPT ); Tue, 5 Feb 2013 06:57:30 -0500 Date: Tue, 5 Feb 2013 11:57:22 +0000 From: Mel Gorman To: Andrew Morton Cc: Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130205115722.GF21389@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20130204160624.5c20a8a0.akpm@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: > On Mon, 4 Feb 2013 18:04:07 +0800 > Lin Feng wrote: > > > get_user_pages() always tries to allocate pages from movable zone, which is not > > reliable to memory hotremove framework in some case. > > > > This patch introduces a new library function called get_user_pages_non_movable() > > to pin pages only from zone non-movable in memory. > > It's a wrapper of get_user_pages() but it makes sure that all pages come from > > non-movable zone via additional page migration. > > > > ... > > > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > struct page **pages, struct vm_area_struct **vmas); > > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > > struct page **pages); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > > + unsigned long start, int nr_pages, int write, int force, > > + struct page **pages, struct vm_area_struct **vmas); > > +#endif > > The ifdefs aren't really needed here and I encourage people to omit > them. This keeps the header files looking neater and reduces the > chances of things later breaking because we forgot to update some > CONFIG_foo logic in a header file. The downside is that errors will be > revealed at link time rather than at compile time, but that's a pretty > small cost. > As an aside, if ifdefs *have* to be used then it often better to have a pattern like #ifdef CONFIG_MEMORY_HOTREMOVE int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas); #else static inline get_user_pages_non_movable(...) { get_user_pages(...) } #endif It eliminates the need for #ifdefs in the C file that calls get_user_pages_non_movable(). > > struct kvec; > > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > > struct page **pages); > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 73b64a3..5db811e 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > > return (idx == ZONE_NORMAL); > > } > > > > +static inline int is_movable(struct zone *zone) > > +{ > > + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > > +} > > A better name would be zone_is_movable(). He is matching the names of the is_normal(), is_dma32() and is_dma() helpers. Unfortunately I would expect a name like is_movable() to return true if the page had either been allocated with __GFP_MOVABLE or if it was checking migrate can currently handle the page. is_movable() is indeed a terrible name. They should be renamed or deleted in preparation -- see below. > We haven't been very > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > > And a neater implementation would be > > return zone_idx(zone) == ZONE_MOVABLE; > Yes. > All of which made me look at mmzone.h, and what I saw wasn't very nice :( > > Please review... > > > From: Andrew Morton > Subject: include/linux/mmzone.h: cleanups > > - implement zone_idx() in C to fix its references-args-twice macro bug > > - use zone_idx() in is_highmem() to remove large amounts of silly fluff. > > Cc: Lin Feng > Cc: Mel Gorman > Signed-off-by: Andrew Morton > --- > > include/linux/mmzone.h | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h > --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups > +++ a/include/linux/mmzone.h > @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by > /* > * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. > */ > -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) > +static inline enum zone_type zone_idx(struct zone *zone) > +{ > + return zone - zone->zone_pgdat->node_zones; > +} > > static inline int populated_zone(struct zone *zone) > { > @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon > static inline int is_highmem(struct zone *zone) > { > #ifdef CONFIG_HIGHMEM > - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || > - (zone_off == ZONE_MOVABLE * sizeof(*zone) && > - zone_movable_is_highmem()); > + enum zone_type idx = zone_idx(zone); > + > + return idx == ZONE_HIGHMEM || > + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); > #else > return 0; > #endif *blinks*. Ok, while I think your version looks nicer, it is reverting ddc81ed2 (remove sparse warning for mmzone.h). According to my version of gcc at least, your patch reintroduces the sar and sparse complains if run as make ARCH=i386 C=2 CF=-Wsparse-all .... this? From: Mel Gorman Subject: [PATCH] include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - implement zone_is_idx in an attempt to explain what the fluff in is_highmem is for - rename is_highmem to zone_is_highmem as preparation for zone_is_movable() - implement zone_is_movable as a helper for places that care about ZONE_MOVABLE - Remove is_normal, is_dma32 and is_dma because apparently no one cares - convert users of zone_idx(zone) == ZONE_FOO to appropriate zone_is_foo() helper - Use is_highmem_idx() instead of is_highmem for the PageHighMem implementation. Should be functionally equivalent because we only care about the zones index, not exactly what zone it is [akpm@linux-foundation.org: zone_idx() reimplementation] Signed-off-by: Andrew Morton Signed-off-by: Mel Gorman --- arch/s390/mm/init.c | 2 +- arch/tile/mm/init.c | 5 +---- arch/x86/mm/highmem_32.c | 2 +- include/linux/mmzone.h | 49 +++++++++++++++----------------------------- include/linux/page-flags.h | 2 +- kernel/power/snapshot.c | 12 +++++------ mm/page_alloc.c | 6 +++--- 7 files changed, 30 insertions(+), 48 deletions(-) diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index 81e596c..c26c321 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -231,7 +231,7 @@ int arch_add_memory(int nid, u64 start, u64 size) if (rc) return rc; for_each_zone(zone) { - if (zone_idx(zone) != ZONE_MOVABLE) { + if (!zone_is_movable(zone)) { /* Add range within existing zone limits */ zone_start_pfn = zone->zone_start_pfn; zone_end_pfn = zone->zone_start_pfn + diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c index ef29d6c..8287749 100644 --- a/arch/tile/mm/init.c +++ b/arch/tile/mm/init.c @@ -733,9 +733,6 @@ static void __init set_non_bootmem_pages_init(void) for_each_zone(z) { unsigned long start, end; int nid = z->zone_pgdat->node_id; -#ifdef CONFIG_HIGHMEM - int idx = zone_idx(z); -#endif start = z->zone_start_pfn; end = start + z->spanned_pages; @@ -743,7 +740,7 @@ static void __init set_non_bootmem_pages_init(void) start = max(start, max_low_pfn); #ifdef CONFIG_HIGHMEM - if (idx == ZONE_HIGHMEM) + if (zone_is_highmem(z)) totalhigh_pages += z->spanned_pages; #endif if (kdata_huge) { diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c index 6f31ee5..63020e7 100644 --- a/arch/x86/mm/highmem_32.c +++ b/arch/x86/mm/highmem_32.c @@ -124,7 +124,7 @@ void __init set_highmem_pages_init(void) for_each_zone(zone) { unsigned long zone_start_pfn, zone_end_pfn; - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) continue; zone_start_pfn = zone->zone_start_pfn; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a23923b..5ecef75 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -779,10 +779,20 @@ static inline int local_memory_node(int node_id) { return node_id; }; unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long); #endif +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) +{ + /* This mess avoids a potentially expensive pointer subtraction. */ + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; + return zone_off == idx * sizeof(*zone); +} + /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone->zone_pgdat->node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -810,50 +820,25 @@ static inline int is_highmem_idx(enum zone_type idx) #endif } -static inline int is_normal_idx(enum zone_type idx) -{ - return (idx == ZONE_NORMAL); -} - /** - * is_highmem - helper function to quickly check if a struct zone is a + * zone_is_highmem - helper function to quickly check if a struct zone is a * highmem zone or not. This is an attempt to keep references * to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum. * @zone - pointer to struct zone variable */ -static inline int is_highmem(struct zone *zone) +static inline bool zone_is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || - (zone_off == ZONE_MOVABLE * sizeof(*zone) && - zone_movable_is_highmem()); -#else - return 0; -#endif -} - -static inline int is_normal(struct zone *zone) -{ - return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL; -} - -static inline int is_dma32(struct zone *zone) -{ -#ifdef CONFIG_ZONE_DMA32 - return zone == zone->zone_pgdat->node_zones + ZONE_DMA32; + return zone_is_idx(zone, ZONE_HIGHMEM) || + (zone_is_idx(zone, ZONE_MOVABLE) && zone_movable_is_highmem()); #else return 0; #endif } -static inline int is_dma(struct zone *zone) +static inline bool zone_is_movable(struct zone *zone) { -#ifdef CONFIG_ZONE_DMA - return zone == zone->zone_pgdat->node_zones + ZONE_DMA; -#else - return 0; -#endif + return zone_is_idx(zone, ZONE_MOVABLE); } /* These two functions are used to setup the per zone pages min values */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index b5d1384..2f541f1 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -237,7 +237,7 @@ PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */ * Must use a macro here due to header dependency issues. page_zone() is not * available at this point. */ -#define PageHighMem(__p) is_highmem(page_zone(__p)) +#define PageHighMem(__p) is_highmem_idx(page_zonenum(__p)) #else PAGEFLAG_FALSE(HighMem) #endif diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 0de2857..cbf0da9 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -830,7 +830,7 @@ static unsigned int count_free_highmem_pages(void) unsigned int cnt = 0; for_each_populated_zone(zone) - if (is_highmem(zone)) + if (zone_is_highmem(zone)) cnt += zone_page_state(zone, NR_FREE_PAGES); return cnt; @@ -879,7 +879,7 @@ static unsigned int count_highmem_pages(void) for_each_populated_zone(zone) { unsigned long pfn, max_zone_pfn; - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) continue; mark_free_pages(zone); @@ -943,7 +943,7 @@ static unsigned int count_data_pages(void) unsigned int n = 0; for_each_populated_zone(zone) { - if (is_highmem(zone)) + if (zone_is_highmem(zone)) continue; mark_free_pages(zone); @@ -989,7 +989,7 @@ static void safe_copy_page(void *dst, struct page *s_page) static inline struct page * page_is_saveable(struct zone *zone, unsigned long pfn) { - return is_highmem(zone) ? + return zone_is_highmem(zone) ? saveable_highmem_page(zone, pfn) : saveable_page(zone, pfn); } @@ -1338,7 +1338,7 @@ int hibernate_preallocate_memory(void) size = 0; for_each_populated_zone(zone) { size += snapshot_additional_pages(zone); - if (is_highmem(zone)) + if (zone_is_highmem(zone)) highmem += zone_page_state(zone, NR_FREE_PAGES); else count += zone_page_state(zone, NR_FREE_PAGES); @@ -1481,7 +1481,7 @@ static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem) unsigned int free = alloc_normal; for_each_populated_zone(zone) - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) free += zone_page_state(zone, NR_FREE_PAGES); nr_pages += count_pages_for_highmem(nr_highmem); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7e208f0..9558341 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5136,7 +5136,7 @@ static void __setup_per_zone_wmarks(void) /* Calculate total number of !ZONE_HIGHMEM pages */ for_each_zone(zone) { - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) lowmem_pages += zone->present_pages; } @@ -5146,7 +5146,7 @@ static void __setup_per_zone_wmarks(void) spin_lock_irqsave(&zone->lock, flags); tmp = (u64)pages_min * zone->present_pages; do_div(tmp, lowmem_pages); - if (is_highmem(zone)) { + if (zone_is_highmem(zone)) { /* * __GFP_HIGH and PF_MEMALLOC allocations usually don't * need highmem pages, so cap pages_min to a small @@ -5585,7 +5585,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count) * For avoiding noise data, lru_add_drain_all() should be called * If ZONE_MOVABLE, the zone never contains unmovable pages */ - if (zone_idx(zone) == ZONE_MOVABLE) + if (zone_is_movable(zone)) return false; mt = get_pageblock_migratetype(page); if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt)) > _ > > > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -58,6 +58,8 @@ > > #include > > #include > > #include > > +#include > > +#include > > #include > > > > #include > > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > } > > EXPORT_SYMBOL(get_user_pages); > > > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > +/** > > + * It's a wrapper of get_user_pages() but it makes sure that all pages come from > > + * non-movable zone via additional page migration. > > + */ > > This needs a description of why the function exists - say something > about the requirements of memory hotplug. > > Also a few words describing how the function works would be good. > > > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > > + unsigned long start, int nr_pages, int write, int force, > > + struct page **pages, struct vm_area_struct **vmas) > > +{ > > + int ret, i, isolate_err, migrate_pre_flag; migrate_pre_flag should be bool and the name is a bit unusual. migrate_prepped? > > + LIST_HEAD(pagelist); > > + > > +retry: > > + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, > > + vmas); > > We should handle (ret < 0) here. At present the function will silently > convert an error return into "return 0", which is bad. The function > does appear to otherwise do the right thing if get_user_pages() failed, > but only due to good luck. > A BUG_ON check if we retry more than once wouldn't hurt either. It requires a broken implementation of alloc_migrate_target() but someone might "fix" it and miss this. A minor aside, it would be nice if we exited at this point if there was no populated ZONE_MOVABLE in the system. There is a movable_zone global variable already. If it was forced to be MAX_NR_ZONES before the call to find_usable_zone_for_movable() in memory initialisation we should be able to make a cheap check for it here. > > + isolate_err = 0; > > + migrate_pre_flag = 0; > > + > > + for (i = 0; i < ret; i++) { > > + if (is_movable(page_zone(pages[i]))) { This is a relatively expensive lookup. If my patch above is used then you can add a is_movable_idx helper and then this becomes if (is_movable_idx(page_zonenum(pages[i]))) which is cheaper. > > + if (!migrate_pre_flag) { > > + if (migrate_prep()) > > + goto put_page; CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never return failure. You could make this BUG_ON(migrate_prep()), remove this goto and the migrate_pre_flag check below becomes redundant. > > + migrate_pre_flag = 1; > > + } > > + > > + if (!isolate_lru_page(pages[i])) { > > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > > + page_is_file_cache(pages[i])); > > + list_add_tail(&pages[i]->lru, &pagelist); > > + } else { > > + isolate_err = 1; > > + goto put_page; > > + } isolate_lru_page() takes the LRU lock every time. If get_user_pages_non_movable is used heavily then you may encounter lock contention problems. Batching this lock would be a separate patch and should not be necessary yet but you should at least comment on it as a reminder. I think that a goto could also have been avoided here if you used break. The i == ret check below would be false and it would just fall through. Overall the flow would be a bit easier to follow. Why list_add_tail()? I don't think it's wrong but it's unusual to see list_add_tail() when list_add() is enough. > > + } > > + } > > + > > + /* All pages are non movable, we are done :) */ > > + if (i == ret && list_empty(&pagelist)) > > + return ret; > > + > > +put_page: > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > > + for (i = 0; i < ret; i++) > > + put_page(pages[i]); > > + release_pages. That comment is insufficient. There are non-obvious consequences to this logic. We are dropping pins on all pages regardless of what zone they are in. If the subsequent migration fails then we end up returning 0 with no pages pinned. The user-visible effect is that io_setup() fails for non-obvious reasons. It will return EAGAIN to userspace which will be interpreted as "The specified nr_events exceeds the user's limit of available events.". The application will either fail or potentially infinite loop if the developer interpreted EAGAIN as "try again" as opposed to "this is a permanent failure". Is that deliberate? Is it really preferable that AIO can fail to setup and the application exit just in case we want to hot-remove memory later? Should a failed migration generate a WARN_ON at least? I would think that it's better to WARN_ON_ONCE if migration fails but pin the pages as requested. If a future memory hot-remove operation fails then the warning will indicate why but applications will not fail as a result. It's a little clumsy but the memory hot-remove failure message could list what applications have pinned the pages that cannot be removed so the administrator has the option of force-killing the application. It is possible to discover what application is pinning a page from userspace but it would involve an expensive search with /proc/kpagemap > > + if (migrate_pre_flag && !isolate_err) { > > + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > > + false, MIGRATE_SYNC, MR_SYSCALL); The conversion of alloc_migrate_target is a bit problematic. It strips the __GFP_MOVABLE flag and the consequence of this is that it converts those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large number of pages, particularly if the number of get_user_pages_non_movable() increases for short-lived pins like direct IO. One way around this is to add a high_zoneidx parameter to __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask) as the high_zoneidx and create a new migration allocation function that passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE > > + /* Steal pages from non-movable zone successfully? */ > > + if (!ret) > > + goto retry; > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > page_list must be reinitialised here (or, better, at the start of the loop). > page_list should be empty if ret == 0. > Mel, what's up with migrate_pages()? Shouldn't it be removing the > pages from the list when MIGRATEPAGE_SUCCESS? The use of > list_for_each_entry_safe() suggests we used to do that... > On successful migration, the pages are put back on the LRU at the end of unmap_and_move(). On migration failure the caller is responsible for putting the pages back on the LRU with putback_lru_pages() which happens below. > > + } > > + > > + putback_lru_pages(&pagelist); > > + return 0; > > +} > > +EXPORT_SYMBOL(get_user_pages_non_movable); > > +#endif > > + > -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755355Ab3BENcw (ORCPT ); Tue, 5 Feb 2013 08:32:52 -0500 Received: from cantor2.suse.de ([195.135.220.15]:50056 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754141Ab3BENcv (ORCPT ); Tue, 5 Feb 2013 08:32:51 -0500 Date: Tue, 5 Feb 2013 13:32:44 +0000 From: Mel Gorman To: Andrew Morton Cc: Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130205133244.GH21389@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20130205115722.GF21389@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote: > > > > + migrate_pre_flag = 1; > > > + } > > > + > > > + if (!isolate_lru_page(pages[i])) { > > > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > > > + page_is_file_cache(pages[i])); > > > + list_add_tail(&pages[i]->lru, &pagelist); > > > + } else { > > > + isolate_err = 1; > > > + goto put_page; > > > + } > > isolate_lru_page() takes the LRU lock every time. Credit to Michal Hocko for bringing this up but with the number of other issues I missed that this is also broken with respect to huge page handling. hugetlbfs pages will not be on the LRU so the isolation will mess up and the migration has to be handled differently. Ordinarily hugetlbfs pages cannot be allocated from ZONE_MOVABLE but it is possible to configure it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this encounters a hugetlbfs page, it'll just blow up. The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok because the caller only cares about the physical address. Migration will also split a hugepage if it receives it but you are potentially adding tail pages to a list here and then migrating them. The split of the first page will get very confused. I'm not exactly sure what the result will be but it won't be pretty. Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled during testing? -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756765Ab3BEVNJ (ORCPT ); Tue, 5 Feb 2013 16:13:09 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:41748 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754228Ab3BEVNG (ORCPT ); Tue, 5 Feb 2013 16:13:06 -0500 Date: Tue, 5 Feb 2013 13:13:04 -0800 From: Andrew Morton To: Lin Feng Cc: mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Tang chen , Gu Zheng Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-Id: <20130205131304.82a5c4cb.akpm@linux-foundation.org> In-Reply-To: <511077E7.4040605@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <511077E7.4040605@cn.fujitsu.com> X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 05 Feb 2013 11:09:27 +0800 Lin Feng wrote: > > > >> struct kvec; > >> int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > >> struct page **pages); > >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > >> index 73b64a3..5db811e 100644 > >> --- a/include/linux/mmzone.h > >> +++ b/include/linux/mmzone.h > >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > >> return (idx == ZONE_NORMAL); > >> } > >> > >> +static inline int is_movable(struct zone *zone) > >> +{ > >> + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > >> +} > > > > A better name would be zone_is_movable(). We haven't been very > > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > > > Yes, zone_is_xxx() should be a better name, but there are some analogous > definition like is_dma32() and is_normal() etc, if we only use zone_is_movable() > for movable zone it will break such naming rules. > Should we update other ones in a separate patch later or just keep the old style? I do think the old names were poorly chosen. Yes, we could fix them up sometime but it's hardly a pressing issue. > > And a neater implementation would be > > > > return zone_idx(zone) == ZONE_MOVABLE; > > > OK. After your change, should we also update for other ones such as is_normal()..? Sure. From: Andrew Morton Subject: include-linux-mmzoneh-cleanups-fix use zone_idx() some more, further simplify is_highmem() Cc: Lin Feng Cc: Mel Gorman Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix +++ a/include/linux/mmzone.h @@ -859,25 +859,18 @@ static inline int is_normal_idx(enum zon */ static inline int is_highmem(struct zone *zone) { -#ifdef CONFIG_HIGHMEM - enum zone_type idx = zone_idx(zone); - - return idx == ZONE_HIGHMEM || - (idx == ZONE_MOVABLE && zone_movable_is_highmem()); -#else - return 0; -#endif + return is_highmem_idx(zone_idx(zone)); } static inline int is_normal(struct zone *zone) { - return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL; + return zone_idx(zone) == ZONE_NORMAL; } static inline int is_dma32(struct zone *zone) { #ifdef CONFIG_ZONE_DMA32 - return zone == zone->zone_pgdat->node_zones + ZONE_DMA32; + return zone_idx(zone) == ZONE_DMA32; #else return 0; #endif @@ -886,7 +879,7 @@ static inline int is_dma32(struct zone * static inline int is_dma(struct zone *zone) { #ifdef CONFIG_ZONE_DMA - return zone == zone->zone_pgdat->node_zones + ZONE_DMA; + return zone_idx(zone) == ZONE_DMA; #else return 0; #endif _ > >> + /* All pages are non movable, we are done :) */ > >> + if (i == ret && list_empty(&pagelist)) > >> + return ret; > >> + > >> +put_page: > >> + /* Undo the effects of former get_user_pages(), we won't pin anything */ > >> + for (i = 0; i < ret; i++) > >> + put_page(pages[i]); > >> + > >> + if (migrate_pre_flag && !isolate_err) { > >> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >> + false, MIGRATE_SYNC, MR_SYSCALL); > >> + /* Steal pages from non-movable zone successfully? */ > >> + if (!ret) > >> + goto retry; > > > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > > page_list must be reinitialised here (or, better, at the start of the loop). > migrate_pages() > list_for_each_entry_safe() > unmap_and_move() > if (rc != -EAGAIN) { > list_del(&page->lru); > } ah, OK, there it is. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757159Ab3BFC0z (ORCPT ); Tue, 5 Feb 2013 21:26:55 -0500 Received: from mail-ve0-f171.google.com ([209.85.128.171]:53660 "EHLO mail-ve0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757112Ab3BFC0w (ORCPT ); Tue, 5 Feb 2013 21:26:52 -0500 MIME-Version: 1.0 In-Reply-To: <20130205115722.GF21389@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> Date: Tue, 5 Feb 2013 18:26:51 -0800 Message-ID: Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() From: Michel Lespinasse To: Mel Gorman Cc: Andrew Morton , Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Just nitpicking, but: On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman wrote: > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) > +{ > + /* This mess avoids a potentially expensive pointer subtraction. */ > + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > + return zone_off == idx * sizeof(*zone); > +} Maybe: return zone == zone->zone_pgdat->node_zones + idx; ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757121Ab3BFKl0 (ORCPT ); Wed, 6 Feb 2013 05:41:26 -0500 Received: from cantor2.suse.de ([195.135.220.15]:33204 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751947Ab3BFKlX (ORCPT ); Wed, 6 Feb 2013 05:41:23 -0500 Date: Wed, 6 Feb 2013 10:41:17 +0000 From: Mel Gorman To: Michel Lespinasse Cc: Andrew Morton , Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130206104117.GO21389@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 05, 2013 at 06:26:51PM -0800, Michel Lespinasse wrote: > Just nitpicking, but: > > On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman wrote: > > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) > > +{ > > + /* This mess avoids a potentially expensive pointer subtraction. */ > > + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > > + return zone_off == idx * sizeof(*zone); > > +} > > Maybe: > return zone == zone->zone_pgdat->node_zones + idx; > ? > Not a nit at all. Yours is more readable but it generates more code. A single line function that uses the helper generates 0x3f bytes of code (mostly function entry/exit) with your version and 0x39 bytes with mine. The difference in efficiency is marginal as your version uses lea to multiply by a constant but it's still slightly heavier. The old code is fractionally better, your version is more readable so it's up to Andrew really. Right now I think he's gone with his own version with zone_idx() in the name of readability whatever sparse has to say about the matter. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752125Ab3BRKgf (ORCPT ); Mon, 18 Feb 2013 05:36:35 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:32642 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1750880Ab3BRKgd (ORCPT ); Mon, 18 Feb 2013 05:36:33 -0500 X-IronPort-AV: E=Sophos;i="4.84,686,1355068800"; d="scan'208";a="6724141" Message-ID: <512203C4.8010608@cn.fujitsu.com> Date: Mon, 18 Feb 2013 18:34:44 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Mel Gorman CC: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> In-Reply-To: <20130205115722.GF21389@suse.de> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/18 18:35:39, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/18 18:35:40, Serialize complete at 2013/02/18 18:35:40 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mel, See below. On 02/05/2013 07:57 PM, Mel Gorman wrote: > On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: >> The ifdefs aren't really needed here and I encourage people to omit >> them. This keeps the header files looking neater and reduces the >> chances of things later breaking because we forgot to update some >> CONFIG_foo logic in a header file. The downside is that errors will be >> revealed at link time rather than at compile time, but that's a pretty >> small cost. >> > > As an aside, if ifdefs *have* to be used then it often better to have a > pattern like > > #ifdef CONFIG_MEMORY_HOTREMOVE > int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > unsigned long start, int nr_pages, int write, int force, > struct page **pages, struct vm_area_struct **vmas); > #else > static inline get_user_pages_non_movable(...) > { > get_user_pages(...) > } > #endif > > It eliminates the need for #ifdefs in the C file that calls > get_user_pages_non_movable(). Thanks for pointing out :) >>> + >>> +retry: >>> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >>> + vmas); >> >> We should handle (ret < 0) here. At present the function will silently >> convert an error return into "return 0", which is bad. The function >> does appear to otherwise do the right thing if get_user_pages() failed, >> but only due to good luck. >> > > A BUG_ON check if we retry more than once wouldn't hurt either. It requires > a broken implementation of alloc_migrate_target() but someone might "fix" > it and miss this. Agree. > > A minor aside, it would be nice if we exited at this point if there was > no populated ZONE_MOVABLE in the system. There is a movable_zone global > variable already. If it was forced to be MAX_NR_ZONES before the call to > find_usable_zone_for_movable() in memory initialisation we should be able > to make a cheap check for it here. OK. >>> + if (!migrate_pre_flag) { >>> + if (migrate_prep()) >>> + goto put_page; > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > return failure. You could make this BUG_ON(migrate_prep()), remove this > goto and the migrate_pre_flag check below becomes redundant. Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check will call/check migrate_prep() every time we isolate a single page, do we have to do so? I mean that for a group of pages it's sufficient to call migrate_prep() only once. There are some other migrate-relative codes using this scheme. So I get confused, do I miss something or the the performance cost is little? :( > >>> + migrate_pre_flag = 1; >>> + } >>> + >>> + if (!isolate_lru_page(pages[i])) { >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>> + page_is_file_cache(pages[i])); >>> + list_add_tail(&pages[i]->lru, &pagelist); >>> + } else { >>> + isolate_err = 1; >>> + goto put_page; >>> + } > > isolate_lru_page() takes the LRU lock every time. If > get_user_pages_non_movable is used heavily then you may encounter lock > contention problems. Batching this lock would be a separate patch and should > not be necessary yet but you should at least comment on it as a reminder. Farsighted, should improve it in the future. > > I think that a goto could also have been avoided here if you used break. The > i == ret check below would be false and it would just fall through. > Overall the flow would be a bit easier to follow. Yes, I noticed this before. I thought using goto could save some micro ops (here is only the i == ret check) though lower the readability but still be clearly. Also there are some other place in current kernel facing such performance/readability race condition, but it seems that the developers prefer readability, why? While I think performance is fatal to kernel.. > > Why list_add_tail()? I don't think it's wrong but it's unusual to see > list_add_tail() when list_add() is enough. Sorry, not intentional, just steal from other codes without thinking more ;-) > >>> + } >>> + } >>> + >>> + /* All pages are non movable, we are done :) */ >>> + if (i == ret && list_empty(&pagelist)) >>> + return ret; >>> + >>> +put_page: >>> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> + for (i = 0; i < ret; i++) >>> + put_page(pages[i]); >>> + > > release_pages. > > That comment is insufficient. There are non-obvious consequences to this > logic. We are dropping pins on all pages regardless of what zone they > are in. If the subsequent migration fails then we end up returning 0 > with no pages pinned. The user-visible effect is that io_setup() fails > for non-obvious reasons. It will return EAGAIN to userspace which will be > interpreted as "The specified nr_events exceeds the user's limit of available > events.". The application will either fail or potentially infinite loop > if the developer interpreted EAGAIN as "try again" as opposed to "this is > a permanent failure". Yes, I missed such things. > > Is that deliberate? Is it really preferable that AIO can fail to setup > and the application exit just in case we want to hot-remove memory later? > Should a failed migration generate a WARN_ON at least? > > I would think that it's better to WARN_ON_ONCE if migration fails but pin > the pages as requested. If a future memory hot-remove operation fails > then the warning will indicate why but applications will not fail as a I'm so agree, I will keep the pin state alive and issue a WARN_ON_ONCE message in such case. > result. It's a little clumsy but the memory hot-remove failure message > could list what applications have pinned the pages that cannot be removed > so the administrator has the option of force-killing the application. It > is possible to discover what application is pinning a page from userspace > but it would involve an expensive search with /proc/kpagemap > >>> + if (migrate_pre_flag && !isolate_err) { >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > The conversion of alloc_migrate_target is a bit problematic. It strips > the __GFP_MOVABLE flag and the consequence of this is that it converts > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > number of pages, particularly if the number of get_user_pages_non_movable() > increases for short-lived pins like direct IO. Sorry, I don't quite understand here neither. If we use the following new migration allocation function as you said, the increasing number of get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE pages. What's the difference, do I miss something? > > One way around this is to add a high_zoneidx parameter to > __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new > inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask) > as the high_zoneidx and create a new migration allocation function that > passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to > happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE On 02/05/2013 09:32 PM, Mel Gorman wrote: > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? Sorry, I hadn't considered THP and hugepage yet, I will deal such cases based on your comments later. thanks a lot for your review :) linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754057Ab3BRPRY (ORCPT ); Mon, 18 Feb 2013 10:17:24 -0500 Received: from cantor2.suse.de ([195.135.220.15]:34101 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753648Ab3BRPRX (ORCPT ); Mon, 18 Feb 2013 10:17:23 -0500 Date: Mon, 18 Feb 2013 15:17:16 +0000 From: Mel Gorman To: Lin Feng Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130218151716.GL4365@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <512203C4.8010608@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <512203C4.8010608@cn.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 18, 2013 at 06:34:44PM +0800, Lin Feng wrote: > >>> + if (!migrate_pre_flag) { > >>> + if (migrate_prep()) > >>> + goto put_page; > > > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > > return failure. You could make this BUG_ON(migrate_prep()), remove this > > goto and the migrate_pre_flag check below becomes redundant. > Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check > will call/check migrate_prep() every time we isolate a single page, do we have > to do so? I was referring to the migrate_pre_flag check further down and I'm not suggesting it be removed from here as you do want to call migrate_prep() only once. However, as it'll never return failure for any kernel configuration that allows memory hot-remove, this goto can be removed and the flow simplified. > > > >>> + migrate_pre_flag = 1; > >>> + } > >>> + > >>> + if (!isolate_lru_page(pages[i])) { > >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > >>> + page_is_file_cache(pages[i])); > >>> + list_add_tail(&pages[i]->lru, &pagelist); > >>> + } else { > >>> + isolate_err = 1; > >>> + goto put_page; > >>> + } > > > > isolate_lru_page() takes the LRU lock every time. If > > get_user_pages_non_movable is used heavily then you may encounter lock > > contention problems. Batching this lock would be a separate patch and should > > not be necessary yet but you should at least comment on it as a reminder. > Farsighted, should improve it in the future. > > > > > I think that a goto could also have been avoided here if you used break. The > > i == ret check below would be false and it would just fall through. > > Overall the flow would be a bit easier to follow. > Yes, I noticed this before. I thought using goto could save some micro ops > (here is only the i == ret check) though lower the readability but still be clearly. > > Also there are some other place in current kernel facing such performance/readability > race condition, but it seems that the developers prefer readability, why? While I > think performance is fatal to kernel.. > Memory hot-remove and page migration are not performance critical paths. For page migration, the cost will be likely dominated by the page copy but it's also possible the cost will be dominated by locking. The difference between a break and goto will not even be measurable. > > > > > > result. It's a little clumsy but the memory hot-remove failure message > > could list what applications have pinned the pages that cannot be removed > > so the administrator has the option of force-killing the application. It > > is possible to discover what application is pinning a page from userspace > > but it would involve an expensive search with /proc/kpagemap > > > >>> + if (migrate_pre_flag && !isolate_err) { > >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > > > The conversion of alloc_migrate_target is a bit problematic. It strips > > the __GFP_MOVABLE flag and the consequence of this is that it converts > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > > number of pages, particularly if the number of get_user_pages_non_movable() > > increases for short-lived pins like direct IO. > > Sorry, I don't quite understand here neither. If we use the following new > migration allocation function as you said, the increasing number of > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE > pages. What's the difference, do I miss something? > The replacement function preserves the __GFP_MOVABLE flag. It cannot use ZONE_MOVABLE but otherwise the newly allocated page will be grouped with other movable pages. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757618Ab3BSJ5K (ORCPT ); Tue, 19 Feb 2013 04:57:10 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:10389 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1755812Ab3BSJ5H (ORCPT ); Tue, 19 Feb 2013 04:57:07 -0500 X-IronPort-AV: E=Sophos;i="4.84,694,1355068800"; d="scan'208";a="6729673" Message-ID: <51234C12.4020404@cn.fujitsu.com> Date: Tue, 19 Feb 2013 17:55:30 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Mel Gorman CC: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <512203C4.8010608@cn.fujitsu.com> <20130218151716.GL4365@suse.de> In-Reply-To: <20130218151716.GL4365@suse.de> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/19 17:56:24, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/19 17:56:26, Serialize complete at 2013/02/19 17:56:26 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mel, On 02/18/2013 11:17 PM, Mel Gorman wrote: >>> > > >>> > > >>> > > result. It's a little clumsy but the memory hot-remove failure message >>> > > could list what applications have pinned the pages that cannot be removed >>> > > so the administrator has the option of force-killing the application. It >>> > > is possible to discover what application is pinning a page from userspace >>> > > but it would involve an expensive search with /proc/kpagemap >>> > > >>>>> > >>> + if (migrate_pre_flag && !isolate_err) { >>>>> > >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >>>>> > >>> + false, MIGRATE_SYNC, MR_SYSCALL); >>> > > >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large >>> > > number of pages, particularly if the number of get_user_pages_non_movable() >>> > > increases for short-lived pins like direct IO. >> > >> > Sorry, I don't quite understand here neither. If we use the following new >> > migration allocation function as you said, the increasing number of >> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE >> > pages. What's the difference, do I miss something? >> > > The replacement function preserves the __GFP_MOVABLE flag. It cannot use > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with > other movable pages. Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.". Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little change? Does that approach work? Otherwise I have to follow your suggestion. thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932694Ab3BSKfJ (ORCPT ); Tue, 19 Feb 2013 05:35:09 -0500 Received: from cantor2.suse.de ([195.135.220.15]:45646 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932515Ab3BSKfE (ORCPT ); Tue, 19 Feb 2013 05:35:04 -0500 Date: Tue, 19 Feb 2013 10:34:59 +0000 From: Mel Gorman To: Lin Feng Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130219103458.GN4365@suse.de> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <512203C4.8010608@cn.fujitsu.com> <20130218151716.GL4365@suse.de> <51234C12.4020404@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <51234C12.4020404@cn.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 19, 2013 at 05:55:30PM +0800, Lin Feng wrote: > Hi Mel, > > On 02/18/2013 11:17 PM, Mel Gorman wrote: > >>> > > > >>> > > > >>> > > result. It's a little clumsy but the memory hot-remove failure message > >>> > > could list what applications have pinned the pages that cannot be removed > >>> > > so the administrator has the option of force-killing the application. It > >>> > > is possible to discover what application is pinning a page from userspace > >>> > > but it would involve an expensive search with /proc/kpagemap > >>> > > > >>>>> > >>> + if (migrate_pre_flag && !isolate_err) { > >>>>> > >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >>>>> > >>> + false, MIGRATE_SYNC, MR_SYSCALL); > >>> > > > >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips > >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts > >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > >>> > > number of pages, particularly if the number of get_user_pages_non_movable() > >>> > > increases for short-lived pins like direct IO. > >> > > >> > Sorry, I don't quite understand here neither. If we use the following new > >> > migration allocation function as you said, the increasing number of > >> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE > >> > pages. What's the difference, do I miss something? > >> > > > The replacement function preserves the __GFP_MOVABLE flag. It cannot use > > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with > > other movable pages. > > Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.". > Update the comment and describe the exception then. > Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so > that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little > change? > It should work (double check gfp_zone) but then the allocation cannot use the highmem zone. If you can be 100% certain that zone will not exist be populated then it will work as expected but it's a hack and should be commented clearly. You could do a BUILD_BUG_ON if CONFIG_HIGHMEM is set to enforce it. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932855Ab3BSNjf (ORCPT ); Tue, 19 Feb 2013 08:39:35 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:47182 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S932760Ab3BSNjc (ORCPT ); Tue, 19 Feb 2013 08:39:32 -0500 X-IronPort-AV: E=Sophos;i="4.84,695,1355068800"; d="scan'208";a="6730291" Message-ID: <51238033.6010005@cn.fujitsu.com> Date: Tue, 19 Feb 2013 21:37:55 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Mel Gorman CC: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> In-Reply-To: <20130205133244.GH21389@suse.de> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/19 21:38:49, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/19 21:38:50, Serialize complete at 2013/02/19 21:38:50 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mel, On 02/05/2013 09:32 PM, Mel Gorman wrote: > On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote: >> >>>> + migrate_pre_flag = 1; >>>> + } >>>> + >>>> + if (!isolate_lru_page(pages[i])) { >>>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>>> + page_is_file_cache(pages[i])); >>>> + list_add_tail(&pages[i]->lru, &pagelist); >>>> + } else { >>>> + isolate_err = 1; >>>> + goto put_page; >>>> + } >> >> isolate_lru_page() takes the LRU lock every time. > > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. I look into the migrate_huge_page() codes find that if we support the hugetlbfs non movable migration, we have to invent another alloc_huge_page_node_nonmovable() or such allocate interface, which cost is large(exploding the codes and great impact on current alloc_huge_page_node()) but gains little, I think that pinning hugepage is a corner case. So can we skip over hugepage without migration but give some WARN_ON() info, is it acceptable? > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok I can't find codes doing such things :(, could you please point me out? > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? I checked my config file that both CONFIG options aboved are enabled. However it was only be tested by two services invoking io_setup(), it works fine.. thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935451Ab3BTCgP (ORCPT ); Tue, 19 Feb 2013 21:36:15 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:44064 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S934420Ab3BTCgO (ORCPT ); Tue, 19 Feb 2013 21:36:14 -0500 X-IronPort-AV: E=Sophos;i="4.84,699,1355068800"; d="scan'208";a="6732269" Message-ID: <5124363C.9060604@cn.fujitsu.com> Date: Wed, 20 Feb 2013 10:34:36 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Mel Gorman CC: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51238033.6010005@cn.fujitsu.com> In-Reply-To: <51238033.6010005@cn.fujitsu.com> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/20 10:35:31, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/20 10:35:32, Serialize complete at 2013/02/20 10:35:32 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/19/2013 09:37 PM, Lin Feng wrote: >> > >> > The other is that this almost certainly broken for transhuge page >> > handling. gup returns the head and tail pages and ordinarily this is ok > I can't find codes doing such things :(, could you please point me out? > Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. flee... thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934774Ab3BTJ6h (ORCPT ); Wed, 20 Feb 2013 04:58:37 -0500 Received: from mail-yh0-f45.google.com ([209.85.213.45]:47527 "EHLO mail-yh0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756797Ab3BTJ6e (ORCPT ); Wed, 20 Feb 2013 04:58:34 -0500 X-Greylist: delayed 83990 seconds by postgrey-1.27 at vger.kernel.org; Wed, 20 Feb 2013 04:58:34 EST Message-ID: <51249E3E.9070909@gmail.com> Date: Wed, 20 Feb 2013 17:58:22 +0800 From: Simon Jeons User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130106 Thunderbird/17.0.2 MIME-Version: 1.0 To: Mel Gorman CC: Andrew Morton , Lin Feng , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> In-Reply-To: <20130205133244.GH21389@suse.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/05/2013 09:32 PM, Mel Gorman wrote: > On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote: >>>> + migrate_pre_flag = 1; >>>> + } >>>> + >>>> + if (!isolate_lru_page(pages[i])) { >>>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>>> + page_is_file_cache(pages[i])); >>>> + list_add_tail(&pages[i]->lru, &pagelist); >>>> + } else { >>>> + isolate_err = 1; >>>> + goto put_page; >>>> + } >> isolate_lru_page() takes the LRU lock every time. > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. As you said, hugetlbfs pages are not in LRU list, then how can encounter a hugetlbfs page and blow up? > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok When need gup thp? in kvm case? > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935195Ab3BTLLx (ORCPT ); Wed, 20 Feb 2013 06:11:53 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:2159 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S935176Ab3BTLLu (ORCPT ); Wed, 20 Feb 2013 06:11:50 -0500 X-IronPort-AV: E=Sophos;i="4.84,701,1355068800"; d="scan'208";a="6736142" Message-ID: <5124A42B.1020905@cn.fujitsu.com> Date: Wed, 20 Feb 2013 18:23:39 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Simon Jeons CC: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51249E3E.9070909@gmail.com> In-Reply-To: <51249E3E.9070909@gmail.com> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/20 18:24:34, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/20 18:25:42, Serialize complete at 2013/02/20 18:25:42 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Simon, On 02/20/2013 05:58 PM, Simon Jeons wrote: > >> >> The other is that this almost certainly broken for transhuge page >> handling. gup returns the head and tail pages and ordinarily this is ok > > When need gup thp? in kvm case? gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. We can't expect the pages have been allocated for user address space are thp or normal page. So we need to deal with them and I think it have nothing to do with kvm. thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759092Ab3BTLcH (ORCPT ); Wed, 20 Feb 2013 06:32:07 -0500 Received: from mail-yh0-f44.google.com ([209.85.213.44]:42640 "EHLO mail-yh0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758727Ab3BTLcF (ORCPT ); Wed, 20 Feb 2013 06:32:05 -0500 Message-ID: <5124B42A.1020908@gmail.com> Date: Wed, 20 Feb 2013 19:31:54 +0800 From: Simon Jeons User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130106 Thunderbird/17.0.2 MIME-Version: 1.0 To: Lin Feng CC: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51249E3E.9070909@gmail.com> <5124A42B.1020905@cn.fujitsu.com> In-Reply-To: <5124A42B.1020905@cn.fujitsu.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/20/2013 06:23 PM, Lin Feng wrote: > Hi Simon, > > On 02/20/2013 05:58 PM, Simon Jeons wrote: >>> The other is that this almost certainly broken for transhuge page >>> handling. gup returns the head and tail pages and ordinarily this is ok >> When need gup thp? in kvm case? > gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. > We can't expect the pages have been allocated for user address space are thp or > normal page. So we need to deal with them and I think it have nothing to do with kvm. Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present? > > thanks, > linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935230Ab3BTM0v (ORCPT ); Wed, 20 Feb 2013 07:26:51 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:54146 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1756355Ab3BTM0t (ORCPT ); Wed, 20 Feb 2013 07:26:49 -0500 X-IronPort-AV: E=Sophos;i="4.84,701,1355068800"; d="scan'208";a="6736858" Message-ID: <5124B991.1020302@cn.fujitsu.com> Date: Wed, 20 Feb 2013 19:54:57 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Simon Jeons CC: Mel Gorman , Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, mhocko@suse.cz, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> <20130205133244.GH21389@suse.de> <51249E3E.9070909@gmail.com> <5124A42B.1020905@cn.fujitsu.com> <5124B42A.1020908@gmail.com> In-Reply-To: <5124B42A.1020908@gmail.com> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/20 19:55:51, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/20 19:55:59, Serialize complete at 2013/02/20 19:55:59 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/20/2013 07:31 PM, Simon Jeons wrote: > On 02/20/2013 06:23 PM, Lin Feng wrote: >> Hi Simon, >> >> On 02/20/2013 05:58 PM, Simon Jeons wrote: >>>> The other is that this almost certainly broken for transhuge page >>>> handling. gup returns the head and tail pages and ordinarily this is ok >>> When need gup thp? in kvm case? >> gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. >> We can't expect the pages have been allocated for user address space are thp or >> normal page. So we need to deal with them and I think it have nothing to do with kvm. > > Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present? No, userspace process doesn't pin any pages directly but through some syscalls like io_setup() indirectly for other purpose because kernel can't pagefault and it have to keep the page alive. Kernel wants to communicate with the userspace such as to notify some events so it need some sort of buffer that both Kernel and User space can both access, which leads to so called pin pages by gup. thanks, linfeng