From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: [PATCH V2 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Date: Tue, 5 Feb 2013 17:21:51 +0800 Message-ID: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Return-path: Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Currently get_user_pages() always tries to allocate pages from movable zone, as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users of get_user_pages() is easy to pin user pages for a long time(for now we found that pages pinned as aio ring pages is such case), which is fatal for memory hotplug/remove framework. So the 1st patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. The 2nd patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works when configed with CONFIG_MEMORY_HOTREMOVE, otherwise it falls back to use the old version of get_user_pages(). --- ChangeLog v1->v2: Patch1: - Fix the negative return value bug pointed out by Andrew and other suggestions pointed out by Andrew and Jeff. Patch2: - Kill the CONFIG_MEMORY_HOTREMOVE dependence suggested by Jeff. --- Lin Feng (2): mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove fs/aio.c | 4 +- include/linux/mm.h | 3 ++ include/linux/mmzone.h | 4 ++ mm/memory.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 +++ 5 files changed, 97 insertions(+), 2 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: [PATCH V2 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Date: Tue, 5 Feb 2013 17:21:53 +0800 Message-ID: <1360056113-14294-3-git-send-email-linfeng@cn.fujitsu.com> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Return-path: In-Reply-To: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org This patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works as configed with CONFIG_MEMORY_HOTREMOVE, otherwise it falls back to use the old version of get_user_pages(). Cc: Benjamin LaHaise Cc: Alexander Viro Cc: Andrew Morton Cc: Jeff Moyer Cc: Minchan Kim Cc: Zach Brown Reviewed-by: Tang Chen Reviewed-by: Gu Zheng Signed-off-by: Lin Feng --- fs/aio.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 71f613c..f7a0d5c 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -138,8 +138,8 @@ static int aio_setup_ring(struct kioctx *ctx) } dprintk("mmap address: 0x%08lx\n", info->mmap_base); - info->nr_pages = get_user_pages(current, ctx->mm, - info->mmap_base, nr_pages, + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, + info->mmap_base, nr_pages, 1, 0, info->ring_pages, NULL); up_write(&ctx->mm->mmap_sem); -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 5 Feb 2013 17:21:52 +0800 Message-ID: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Return-path: In-Reply-To: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org get_user_pages() always tries to allocate pages from movable zone, which is not reliable to memory hotremove framework in some case. This patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. Cc: Andrew Morton Cc: Mel Gorman Cc: KAMEZAWA Hiroyuki Cc: Yasuaki Ishimatsu Cc: Jeff Moyer Cc: Minchan Kim Cc: Zach Brown Reviewed-by: Tang Chen Reviewed-by: Gu Zheng Signed-off-by: Lin Feng --- include/linux/mm.h | 3 ++ include/linux/mmzone.h | 4 ++ mm/memory.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 +++ 4 files changed, 95 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 12f5a09..3ff9eba 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1049,6 +1049,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas); struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e25ab6f..1506351 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -841,6 +841,10 @@ static inline int is_normal_idx(enum zone_type idx) return (idx == ZONE_NORMAL); } +static inline int zone_is_movable(struct zone *zone) +{ + return zone_idx(zone) == ZONE_MOVABLE; +} /** * is_highmem - helper function to quickly check if a struct zone is a * highmem zone or not. This is an attempt to keep references diff --git a/mm/memory.c b/mm/memory.c index bb1369f..ede53cc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -58,6 +58,8 @@ #include #include #include +#include +#include #include #include @@ -1995,6 +1997,87 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +/** + * It's a wrapper of get_user_pages() but it makes sure that all pages come from + * non-movable zone via additional page migration. It's designed for memory + * hotremove framework. + * + * Currently get_user_pages() always tries to allocate pages from movable zone, + * in some case users of get_user_pages() is easy to pin user pages for a long + * time(for now we found that pages pinned as aio ring pages is such case), + * which is fatal for memory hotremove framework. + * + * This function first calls get_user_pages() to get the candidate pages, and + * then check to ensure all pages are from non movable zone. Otherwise migrate + * them to non movable zone, then retry. It will at most retry once. + */ +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas) +{ + int ret, i, isolate_err, migrate_pre_flag; + LIST_HEAD(pagelist); + +retry: + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, + vmas); + if (ret <= 0) + return ret; + + isolate_err = 0; + migrate_pre_flag = 0; + + for (i = 0; i < ret; i++) { + if (zone_is_movable(page_zone(pages[i]))) { + if (!migrate_pre_flag) { + if (migrate_prep()) + goto release_page; + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + + page_is_file_cache(pages[i])); + list_add_tail(&pages[i]->lru, &pagelist); + } else { + isolate_err = 1; + goto release_page; + } + } + } + + /* All pages are non movable, we are done :) */ + if (i == ret && list_empty(&pagelist)) + return ret; + +release_page: + /* Undo the effects of former get_user_pages(), we won't pin anything */ + release_pages(pages, ret, 1); + + if (migrate_pre_flag && !isolate_err) { + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, + false, MIGRATE_SYNC, MR_SYSCALL); + /* Steal pages from non-movable zone successfully? */ + if (!ret) + goto retry; + } + + putback_lru_pages(&pagelist); + /* Migration failed, we pin 0 page, tell caller the truth */ + return 0; +} +#else +inline int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas) +{ + return get_user_pages(tsk, mm, start, nr_pages, write, force, pages, + vmas); +} +#endif +EXPORT_SYMBOL(get_user_pages_non_movable); + /** * get_dump_page() - pin user page in memory while writing it to core dump * @addr: user address diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 383bdbb..1b7bd17 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, return ret ? 0 : -EBUSY; } +/** + * @private: 0 means page can be alloced from movable zone, otherwise forbidden + */ struct page *alloc_migrate_target(struct page *page, unsigned long private, int **resultp) { @@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private, if (PageHighMem(page)) gfp_mask |= __GFP_HIGHMEM; + if (unlikely(private != 0)) + gfp_mask &= ~__GFP_MOVABLE; return alloc_page(gfp_mask); } -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 5 Feb 2013 12:01:37 +0000 Message-ID: <20130205120137.GG21389@suse.de> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Lin Feng Return-path: Content-Disposition: inline In-Reply-To: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: > get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > > This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > Cc: Andrew Morton > Cc: Mel Gorman > Cc: KAMEZAWA Hiroyuki > Cc: Yasuaki Ishimatsu > Cc: Jeff Moyer > Cc: Minchan Kim > Cc: Zach Brown > Reviewed-by: Tang Chen > Reviewed-by: Gu Zheng > Signed-off-by: Lin Feng I already had started the review of V1 before this was sent unfortunately. However, I think the feedback I gave for V1 is still valid so I'll wait for comments on that review before digging further. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 6 Feb 2013 09:42:34 +0900 Message-ID: <20130206004234.GD11197@blaptop> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Mel Gorman Return-path: Content-Disposition: inline In-Reply-To: <20130205120137.GG21389@suse.de> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, Feb 05, 2013 at 12:01:37PM +0000, Mel Gorman wrote: > On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: > > get_user_pages() always tries to allocate pages from movable zone, which is not > > reliable to memory hotremove framework in some case. > > > > This patch introduces a new library function called get_user_pages_non_movable() > > to pin pages only from zone non-movable in memory. > > It's a wrapper of get_user_pages() but it makes sure that all pages come from > > non-movable zone via additional page migration. > > > > Cc: Andrew Morton > > Cc: Mel Gorman > > Cc: KAMEZAWA Hiroyuki > > Cc: Yasuaki Ishimatsu > > Cc: Jeff Moyer > > Cc: Minchan Kim > > Cc: Zach Brown > > Reviewed-by: Tang Chen > > Reviewed-by: Gu Zheng > > Signed-off-by: Lin Feng > > I already had started the review of V1 before this was sent > unfortunately. However, I think the feedback I gave for V1 is still > valid so I'll wait for comments on that review before digging further. Mel, Andrew Sorry for making noise if you already confirmed the direction but I have a concern about that. Because IMHO, we can't expect most of user for MEMORY_HOTPLUG will release pinned pages immediately. In addtion, MEMORY_HOTPLUG could be used for embedded system for reducing power by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know in advance they will use pinned pages long time or release in short time because it depends on some event like user's response which is very not predetermined. So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of failing migration by page count and then, investigate they are really using GUP and it's REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? Yes. it could be done but it would be rather trobulesome job. Even it couldn't be triggered during QE phase so that trouble doesn't end until all guys uses GUP_NM. Let's consider another case. Some driver pin the page in very short time so he decide to use GUP instead of GUP_NM but someday, someuser start to use the driver very often so although pinning time is very short, it could be forever pinning effect if the use calls it very often. In the end, we should change it with GUP_NM, again. IMHO, In future, we ends up changing most of GUP user with GUP_NM if CMA and MEMORY_HOTPLUG is available all over the world. So, what's wrong if we replace get_user_pages with get_user_pages_non_movable in MEMORY_HOTPLUG/CMA without exposing get_user_pages_non_movable? I mean this #ifdef CONFIG_MIGRATE_ISOLATE int get_user_pages() { return __get_user_pages_non_movable(); } #else int get_user_pages() { return old_get_user_pages(); } #endif IMHO, get_user_pages isn't performance sensitive function. If user was sensitive about it, he should have tried get_user_pages_fast. THP degradation by increasing MIGRATE_UNMOVABLE? Lin said most of GUP pages release the page in short so is it really problem? Even in embedded, we don't use THP yet but CMA and GUP call would be not too often but failing of CMA would be critical. I'd like to hear opinions. > > -- > Mel Gorman > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin LaHaise Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 5 Feb 2013 19:52:17 -0500 Message-ID: <20130206005217.GJ20842@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Mel Gorman , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Minchan Kim Return-path: Content-Disposition: inline In-Reply-To: <20130206004234.GD11197@blaptop> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Feb 06, 2013 at 09:42:34AM +0900, Minchan Kim wrote: > THP degradation by increasing MIGRATE_UNMOVABLE? > Lin said most of GUP pages release the page in short so is it really problem? > Even in embedded, we don't use THP yet but CMA and GUP call would be not too often > but failing of CMA would be critical. > > I'd like to hear opinions. If aio was given a callback to migrate the pages on, it could just migrate the pages as needed. There's nothing fundamental preventing that approach. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 6 Feb 2013 09:56:17 +0000 Message-ID: <20130206095617.GN21389@suse.de> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Minchan Kim Return-path: Content-Disposition: inline In-Reply-To: <20130206004234.GD11197@blaptop> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Wed, Feb 06, 2013 at 09:42:34AM +0900, Minchan Kim wrote: > On Tue, Feb 05, 2013 at 12:01:37PM +0000, Mel Gorman wrote: > > On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: > > > get_user_pages() always tries to allocate pages from movable zone, which is not > > > reliable to memory hotremove framework in some case. > > > > > > This patch introduces a new library function called get_user_pages_non_movable() > > > to pin pages only from zone non-movable in memory. > > > It's a wrapper of get_user_pages() but it makes sure that all pages come from > > > non-movable zone via additional page migration. > > > > > > Cc: Andrew Morton > > > Cc: Mel Gorman > > > Cc: KAMEZAWA Hiroyuki > > > Cc: Yasuaki Ishimatsu > > > Cc: Jeff Moyer > > > Cc: Minchan Kim > > > Cc: Zach Brown > > > Reviewed-by: Tang Chen > > > Reviewed-by: Gu Zheng > > > Signed-off-by: Lin Feng > > > > I already had started the review of V1 before this was sent > > unfortunately. However, I think the feedback I gave for V1 is still > > valid so I'll wait for comments on that review before digging further. > > Mel, Andrew > > Sorry for making noise if you already confirmed the direction but I have a concern > about that. I haven't confirmed any sort of direction, nor do I determine the direction for memory hot-remove which I'm only paying vague attention to. I stated a while ago that I think the use of ZONE_MOVABLE is a bad idea for "guaranteeing" memory hot-remove and is already going the "wrong" direction. That's just my opinion. This patch is about mitigating (but not solving) the problem of long-lived pins. In the general case, about all I could think of for that is that the kernel would have to warn the administrator what applications had pinned the memory and wait for the user to shut them down. To guarantee anything, it would be necessary for subsystems to implement a callback for migration to unpin pages, barrier operations until migration completes and pin the new pfns. > Because IMHO, we can't expect most of user for MEMORY_HOTPLUG will release > pinned pages immediately. Indeed not, but it's not really what this patch is about. This patch is about moving the pages before they get permanently pinned. It mitigates the problem but does not solve it because there is no guarantee that the driver pinning a page will flag it properly. > In addtion, MEMORY_HOTPLUG could be used for embedded system > for reducing power by PASR and some drivers in embedded could use GUP anytime and anywhere. > They can't know in advance they will use pinned pages long time or release in short time > because it depends on some event like user's response which is very not predetermined. True. This patch does not solve that problem. > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > failing migration by page count and then, investigate they are really using GUP and > it's REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? Within the context of this patch, that is their main option. Finding who is holding the pin is a problem. For userspace-pinned buffers it's straight-forward as rmap will identify what processes are holding the pin (page->list vmas->mm, lookup all tasks until p->mm == mm) and report that. For driver-related pins, it's not as straight-forward. I guess there could be callback to give meaningful information on it but no guarantee that drivers pinning pages will implement it. In that case all you could do was dump page->mapping and punt it at a kernel developer to figure out the responsible driver. This might be managable for memory hot-remove where there is an administator but may not work at all for embedded users. There is the possibility that callbacks could be introduced for migrate_unpin() and migrate_pin() that takes a list of PFN pairs (old,new). The unpin callback should release the old PFNs and barrier against any operations until the migrate_pfn() callback is called with the updated pfns to be repinned. Again it would fully depend on subsystems implementing it properly. The callback interface would be more robust but puts a lot more work on the driver side where your milage will vary. > Yes. it could be done but it would be rather trobulesome job. Even it couldn't be triggered > during QE phase so that trouble doesn't end until all guys uses GUP_NM. > Let's consider another case. Some driver pin the page in very short time > so he decide to use GUP instead of GUP_NM but someday, someuser start to use the driver > very often so although pinning time is very short, it could be forever pinning effect > if the use calls it very often. In the end, we should change it with GUP_NM, again. > IMHO, In future, we ends up changing most of GUP user with GUP_NM if CMA and MEMORY_HOTPLUG > is available all over the world. > Same thing, callbacks to unpin and barrier would handle such a case by effectively freezing the driver or subsystem responsible for the page. > So, what's wrong if we replace get_user_pages with get_user_pages_non_movable > in MEMORY_HOTPLUG/CMA without exposing get_user_pages_non_movable? > > I mean this > > #ifdef CONFIG_MIGRATE_ISOLATE > int get_user_pages() > { > return __get_user_pages_non_movable(); > } > #else > int get_user_pages() > { > return old_get_user_pages(); > } > #endif > That will migrate everything out of ZONE_MOVABLE every time it's pinned. One consequence is that direct IO can never use ZONE_MOVABLE on these systems. It'll create a variation of the lowmem exhaustion problem. > IMHO, get_user_pages isn't performance sensitive function. If user was sensitive > about it, he should have tried get_user_pages_fast. That opens a different cans of works. get_user_pages is part of the gup_fast slowpath. > THP degradation by increasing MIGRATE_UNMOVABLE? The patch should not be converting MIGRATE_MOVABLE requests to MIGRATE_UNMOVABLE. I covered this in the review of v1. > Lin said most of GUP pages release the page in short so is it really problem? > Even in embedded, we don't use THP yet but CMA and GUP call would be not too often > but failing of CMA would be critical. > To guarantee CMA can migrate pages pinned by drivers I think you need migrate-related callsbacks to unpin, barrier the driver until migration completes and repin. I do not know, or at least have no heard, of anyone working on such a scheme. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Minchan Kim Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Fri, 8 Feb 2013 11:32:37 +0900 Message-ID: <20130208023237.GK11197@blaptop> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Mel Gorman Return-path: Content-Disposition: inline In-Reply-To: <20130206095617.GN21389@suse.de> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Mel, On Wed, Feb 06, 2013 at 09:56:17AM +0000, Mel Gorman wrote: > On Wed, Feb 06, 2013 at 09:42:34AM +0900, Minchan Kim wrote: > > On Tue, Feb 05, 2013 at 12:01:37PM +0000, Mel Gorman wrote: > > > On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: > > > > get_user_pages() always tries to allocate pages from movable zone, which is not > > > > reliable to memory hotremove framework in some case. > > > > > > > > This patch introduces a new library function called get_user_pages_non_movable() > > > > to pin pages only from zone non-movable in memory. > > > > It's a wrapper of get_user_pages() but it makes sure that all pages come from > > > > non-movable zone via additional page migration. > > > > > > > > Cc: Andrew Morton > > > > Cc: Mel Gorman > > > > Cc: KAMEZAWA Hiroyuki > > > > Cc: Yasuaki Ishimatsu > > > > Cc: Jeff Moyer > > > > Cc: Minchan Kim > > > > Cc: Zach Brown > > > > Reviewed-by: Tang Chen > > > > Reviewed-by: Gu Zheng > > > > Signed-off-by: Lin Feng > > > > > > I already had started the review of V1 before this was sent > > > unfortunately. However, I think the feedback I gave for V1 is still > > > valid so I'll wait for comments on that review before digging further. > > > > Mel, Andrew > > > > Sorry for making noise if you already confirmed the direction but I have a concern > > about that. > > I haven't confirmed any sort of direction, nor do I determine the > direction for memory hot-remove which I'm only paying vague attention to. > I stated a while ago that I think the use of ZONE_MOVABLE is a bad idea > for "guaranteeing" memory hot-remove and is already going the "wrong" > direction. That's just my opinion. > > This patch is about mitigating (but not solving) the problem of long-lived > pins. In the general case, about all I could think of for that is that the Agreed. > kernel would have to warn the administrator what applications had pinned > the memory and wait for the user to shut them down. To guarantee anything, > it would be necessary for subsystems to implement a callback for migration > to unpin pages, barrier operations until migration completes and pin the > new pfns. It could be applied for SUBSYSTEM but it's very hard for all DRIVER developer, and I doubt we can give them a common template most of driver developers can reuse it. > > > Because IMHO, we can't expect most of user for MEMORY_HOTPLUG will release > > pinned pages immediately. > > Indeed not, but it's not really what this patch is about. This patch is > about moving the pages before they get permanently pinned. It mitigates > the problem but does not solve it because there is no guarantee that the > driver pinning a page will flag it properly. True. And I doubt what memory-hotplug guys really want is best effort, not guarantee. Anway, CMA want to guarantee, even low latency and I hope this patch solves both memory-hotplug and CMA solve the problem. > > > In addtion, MEMORY_HOTPLUG could be used for embedded system > > for reducing power by PASR and some drivers in embedded could use GUP anytime and anywhere. > > They can't know in advance they will use pinned pages long time or release in short time > > because it depends on some event like user's response which is very not predetermined. > > True. This patch does not solve that problem. > > > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > > failing migration by page count and then, investigate they are really using GUP and > > it's REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? > > Within the context of this patch, that is their main option. Finding > who is holding the pin is a problem. For userspace-pinned buffers it's > straight-forward as rmap will identify what processes are holding the > pin (page->list vmas->mm, lookup all tasks until p->mm == mm) and report > that. For driver-related pins, it's not as straight-forward. I guess there True. > could be callback to give meaningful information on it but no guarantee > that drivers pinning pages will implement it. In that case all you could do Nod. > was dump page->mapping and punt it at a kernel developer to figure out the > responsible driver. This might be managable for memory hot-remove where > there is an administator but may not work at all for embedded users. Yeab. Even there are proprietary modules in embedded, we can't see soruce code. > > There is the possibility that callbacks could be introduced for > migrate_unpin() and migrate_pin() that takes a list of PFN pairs > (old,new). The unpin callback should release the old PFNs and barrier > against any operations until the migrate_pfn() callback is called with > the updated pfns to be repinned. Again it would fully depend on subsystems > implementing it properly. > > The callback interface would be more robust but puts a lot more work on > the driver side where your milage will vary. True. > > > Yes. it could be done but it would be rather trobulesome job. Even it couldn't be triggered > > during QE phase so that trouble doesn't end until all guys uses GUP_NM. > > Let's consider another case. Some driver pin the page in very short time > > so he decide to use GUP instead of GUP_NM but someday, someuser start to use the driver > > very often so although pinning time is very short, it could be forever pinning effect > > if the use calls it very often. In the end, we should change it with GUP_NM, again. > > IMHO, In future, we ends up changing most of GUP user with GUP_NM if CMA and MEMORY_HOTPLUG > > is available all over the world. > > > > Same thing, callbacks to unpin and barrier would handle such a case by > effectively freezing the driver or subsystem responsible for the page. > > > So, what's wrong if we replace get_user_pages with get_user_pages_non_movable > > in MEMORY_HOTPLUG/CMA without exposing get_user_pages_non_movable? > > > > I mean this > > > > #ifdef CONFIG_MIGRATE_ISOLATE > > int get_user_pages() > > { > > return __get_user_pages_non_movable(); > > } > > #else > > int get_user_pages() > > { > > return old_get_user_pages(); > > } > > #endif > > > > That will migrate everything out of ZONE_MOVABLE every time it's pinned. > One consequence is that direct IO can never use ZONE_MOVABLE on these > systems. It'll create a variation of the lowmem exhaustion problem. For example, there is 4G highmem zone and half of it is movable zone. In thit case, we can use extra 2G highmem zone space instead of lowmem. But I agree it could end up pinning many pages of lowmem so the problem would happens. IMHO, it should be trade-off for using MEMORY-HOTPLUG/CMA? > > > IMHO, get_user_pages isn't performance sensitive function. If user was sensitive > > about it, he should have tried get_user_pages_fast. > > That opens a different cans of works. get_user_pages is part of the > gup_fast slowpath. > > > THP degradation by increasing MIGRATE_UNMOVABLE? > > The patch should not be converting MIGRATE_MOVABLE requests to > MIGRATE_UNMOVABLE. I covered this in the review of v1. I guess memory-hotplug guys want to use GUP_NM for long-time pin user. So doesn't it make sense to migrate the page into MIGRATE_UNMOVABLE? But I'm not sure GUP_NM's semantic. > > > Lin said most of GUP pages release the page in short so is it really problem? > > Even in embedded, we don't use THP yet but CMA and GUP call would be not too often > > but failing of CMA would be critical. > > > > To guarantee CMA can migrate pages pinned by drivers I think you need > migrate-related callsbacks to unpin, barrier the driver until migration > completes and repin. I agree it's a ideal solution when we consider in future but as you already mentioned, it's not easy for all drivers. In fact, I don't want to insist on my opinion for CMA because I guess CMA design is not good from the beginning. I just posted my concern and want to discuss to solve the problem but if there are not plain solution now, let me pass the decision to maintainer. Thanks for sharing your opinion, Mel! > > I do not know, or at least have no heard, of anyone working on such a > scheme. > > -- > Mel Gorman > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wanpeng Li Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 19:37:57 +0800 Message-ID: <19348.4896830798$1361360320@news.gmane.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Reply-To: Wanpeng Li Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng To: Lin Feng Return-path: Content-Disposition: inline In-Reply-To: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: >get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > >This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. >It's a wrapper of get_user_pages() but it makes sure that all pages come from >non-movable zone via additional page migration. > >Cc: Andrew Morton >Cc: Mel Gorman >Cc: KAMEZAWA Hiroyuki >Cc: Yasuaki Ishimatsu >Cc: Jeff Moyer >Cc: Minchan Kim >Cc: Zach Brown >Reviewed-by: Tang Chen >Reviewed-by: Gu Zheng >Signed-off-by: Lin Feng >--- > include/linux/mm.h | 3 ++ > include/linux/mmzone.h | 4 ++ > mm/memory.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ > mm/page_isolation.c | 5 +++ > 4 files changed, 95 insertions(+), 0 deletions(-) > >diff --git a/include/linux/mm.h b/include/linux/mm.h >index 12f5a09..3ff9eba 100644 >--- a/include/linux/mm.h >+++ b/include/linux/mm.h >@@ -1049,6 +1049,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > struct page **pages, struct vm_area_struct **vmas); > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages); >+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas); > struct kvec; > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > struct page **pages); >diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >index e25ab6f..1506351 100644 >--- a/include/linux/mmzone.h >+++ b/include/linux/mmzone.h >@@ -841,6 +841,10 @@ static inline int is_normal_idx(enum zone_type idx) > return (idx == ZONE_NORMAL); > } > >+static inline int zone_is_movable(struct zone *zone) >+{ >+ return zone_idx(zone) == ZONE_MOVABLE; >+} > /** > * is_highmem - helper function to quickly check if a struct zone is a > * highmem zone or not. This is an attempt to keep references >diff --git a/mm/memory.c b/mm/memory.c >index bb1369f..ede53cc 100644 >--- a/mm/memory.c >+++ b/mm/memory.c >@@ -58,6 +58,8 @@ > #include > #include > #include >+#include >+#include > #include > > #include >@@ -1995,6 +1997,87 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > } > EXPORT_SYMBOL(get_user_pages); > >+#ifdef CONFIG_MEMORY_HOTREMOVE >+/** >+ * It's a wrapper of get_user_pages() but it makes sure that all pages come from >+ * non-movable zone via additional page migration. It's designed for memory >+ * hotremove framework. >+ * >+ * Currently get_user_pages() always tries to allocate pages from movable zone, >+ * in some case users of get_user_pages() is easy to pin user pages for a long >+ * time(for now we found that pages pinned as aio ring pages is such case), >+ * which is fatal for memory hotremove framework. >+ * >+ * This function first calls get_user_pages() to get the candidate pages, and >+ * then check to ensure all pages are from non movable zone. Otherwise migrate How about "Otherwise migrate candidate pages which have already been isolated to non movable zone."? >+ * them to non movable zone, then retry. It will at most retry once. >+ */ >+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas) >+{ >+ int ret, i, isolate_err, migrate_pre_flag; >+ LIST_HEAD(pagelist); >+ >+retry: >+ ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >+ vmas); >+ if (ret <= 0) >+ return ret; >+ >+ isolate_err = 0; >+ migrate_pre_flag = 0; >+ >+ for (i = 0; i < ret; i++) { >+ if (zone_is_movable(page_zone(pages[i]))) { >+ if (!migrate_pre_flag) { >+ if (migrate_prep()) >+ goto release_page; >+ migrate_pre_flag = 1; >+ } >+ >+ if (!isolate_lru_page(pages[i])) { >+ inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >+ page_is_file_cache(pages[i])); >+ list_add_tail(&pages[i]->lru, &pagelist); >+ } else { >+ isolate_err = 1; >+ goto release_page; >+ } >+ } >+ } >+ >+ /* All pages are non movable, we are done :) */ >+ if (i == ret && list_empty(&pagelist)) >+ return ret; >+ >+release_page: >+ /* Undo the effects of former get_user_pages(), we won't pin anything */ >+ release_pages(pages, ret, 1); >+ >+ if (migrate_pre_flag && !isolate_err) { >+ ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >+ false, MIGRATE_SYNC, MR_SYSCALL); >+ /* Steal pages from non-movable zone successfully? */ >+ if (!ret) >+ goto retry; >+ } >+ >+ putback_lru_pages(&pagelist); >+ /* Migration failed, we pin 0 page, tell caller the truth */ >+ return 0; >+} >+#else >+inline int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas) >+{ >+ return get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >+ vmas); >+} >+#endif >+EXPORT_SYMBOL(get_user_pages_non_movable); >+ > /** > * get_dump_page() - pin user page in memory while writing it to core dump > * @addr: user address >diff --git a/mm/page_isolation.c b/mm/page_isolation.c >index 383bdbb..1b7bd17 100644 >--- a/mm/page_isolation.c >+++ b/mm/page_isolation.c >@@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, > return ret ? 0 : -EBUSY; > } > >+/** >+ * @private: 0 means page can be alloced from movable zone, otherwise forbidden >+ */ > struct page *alloc_migrate_target(struct page *page, unsigned long private, > int **resultp) > { >@@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private, > > if (PageHighMem(page)) > gfp_mask |= __GFP_HIGHMEM; >+ if (unlikely(private != 0)) >+ gfp_mask &= ~__GFP_MOVABLE; > > return alloc_page(gfp_mask); > } >-- >1.7.1 > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wanpeng Li Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 19:37:57 +0800 Message-ID: <2773.66713057763$1361360322@news.gmane.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Reply-To: Wanpeng Li Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng To: Lin Feng Return-path: Content-Disposition: inline In-Reply-To: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: >get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > >This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. >It's a wrapper of get_user_pages() but it makes sure that all pages come from >non-movable zone via additional page migration. > >Cc: Andrew Morton >Cc: Mel Gorman >Cc: KAMEZAWA Hiroyuki >Cc: Yasuaki Ishimatsu >Cc: Jeff Moyer >Cc: Minchan Kim >Cc: Zach Brown >Reviewed-by: Tang Chen >Reviewed-by: Gu Zheng >Signed-off-by: Lin Feng >--- > include/linux/mm.h | 3 ++ > include/linux/mmzone.h | 4 ++ > mm/memory.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ > mm/page_isolation.c | 5 +++ > 4 files changed, 95 insertions(+), 0 deletions(-) > >diff --git a/include/linux/mm.h b/include/linux/mm.h >index 12f5a09..3ff9eba 100644 >--- a/include/linux/mm.h >+++ b/include/linux/mm.h >@@ -1049,6 +1049,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > struct page **pages, struct vm_area_struct **vmas); > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages); >+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas); > struct kvec; > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > struct page **pages); >diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >index e25ab6f..1506351 100644 >--- a/include/linux/mmzone.h >+++ b/include/linux/mmzone.h >@@ -841,6 +841,10 @@ static inline int is_normal_idx(enum zone_type idx) > return (idx == ZONE_NORMAL); > } > >+static inline int zone_is_movable(struct zone *zone) >+{ >+ return zone_idx(zone) == ZONE_MOVABLE; >+} > /** > * is_highmem - helper function to quickly check if a struct zone is a > * highmem zone or not. This is an attempt to keep references >diff --git a/mm/memory.c b/mm/memory.c >index bb1369f..ede53cc 100644 >--- a/mm/memory.c >+++ b/mm/memory.c >@@ -58,6 +58,8 @@ > #include > #include > #include >+#include >+#include > #include > > #include >@@ -1995,6 +1997,87 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > } > EXPORT_SYMBOL(get_user_pages); > >+#ifdef CONFIG_MEMORY_HOTREMOVE >+/** >+ * It's a wrapper of get_user_pages() but it makes sure that all pages come from >+ * non-movable zone via additional page migration. It's designed for memory >+ * hotremove framework. >+ * >+ * Currently get_user_pages() always tries to allocate pages from movable zone, >+ * in some case users of get_user_pages() is easy to pin user pages for a long >+ * time(for now we found that pages pinned as aio ring pages is such case), >+ * which is fatal for memory hotremove framework. >+ * >+ * This function first calls get_user_pages() to get the candidate pages, and >+ * then check to ensure all pages are from non movable zone. Otherwise migrate How about "Otherwise migrate candidate pages which have already been isolated to non movable zone."? >+ * them to non movable zone, then retry. It will at most retry once. >+ */ >+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas) >+{ >+ int ret, i, isolate_err, migrate_pre_flag; >+ LIST_HEAD(pagelist); >+ >+retry: >+ ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >+ vmas); >+ if (ret <= 0) >+ return ret; >+ >+ isolate_err = 0; >+ migrate_pre_flag = 0; >+ >+ for (i = 0; i < ret; i++) { >+ if (zone_is_movable(page_zone(pages[i]))) { >+ if (!migrate_pre_flag) { >+ if (migrate_prep()) >+ goto release_page; >+ migrate_pre_flag = 1; >+ } >+ >+ if (!isolate_lru_page(pages[i])) { >+ inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >+ page_is_file_cache(pages[i])); >+ list_add_tail(&pages[i]->lru, &pagelist); >+ } else { >+ isolate_err = 1; >+ goto release_page; >+ } >+ } >+ } >+ >+ /* All pages are non movable, we are done :) */ >+ if (i == ret && list_empty(&pagelist)) >+ return ret; >+ >+release_page: >+ /* Undo the effects of former get_user_pages(), we won't pin anything */ >+ release_pages(pages, ret, 1); >+ >+ if (migrate_pre_flag && !isolate_err) { >+ ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >+ false, MIGRATE_SYNC, MR_SYSCALL); >+ /* Steal pages from non-movable zone successfully? */ >+ if (!ret) >+ goto retry; >+ } >+ >+ putback_lru_pages(&pagelist); >+ /* Migration failed, we pin 0 page, tell caller the truth */ >+ return 0; >+} >+#else >+inline int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas) >+{ >+ return get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >+ vmas); >+} >+#endif >+EXPORT_SYMBOL(get_user_pages_non_movable); >+ > /** > * get_dump_page() - pin user page in memory while writing it to core dump > * @addr: user address >diff --git a/mm/page_isolation.c b/mm/page_isolation.c >index 383bdbb..1b7bd17 100644 >--- a/mm/page_isolation.c >+++ b/mm/page_isolation.c >@@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, > return ret ? 0 : -EBUSY; > } > >+/** >+ * @private: 0 means page can be alloced from movable zone, otherwise forbidden >+ */ > struct page *alloc_migrate_target(struct page *page, unsigned long private, > int **resultp) > { >@@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private, > > if (PageHighMem(page)) > gfp_mask |= __GFP_HIGHMEM; >+ if (unlikely(private != 0)) >+ gfp_mask &= ~__GFP_MOVABLE; > > return alloc_page(gfp_mask); > } >-- >1.7.1 > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 20:39:02 +0800 Message-ID: <5124C3E6.1060108@cn.fujitsu.com> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130220113757.GA10124@hacker.(null)> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Wanpeng Li Return-path: In-Reply-To: <20130220113757.GA10124@hacker.(null)> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Wanpeng, On 02/20/2013 07:37 PM, Wanpeng Li wrote: >> + * This function first calls get_user_pages() to get the candidate pages, and >> >+ * then check to ensure all pages are from non movable zone. Otherwise migrate > How about "Otherwise migrate candidate pages which have already been > isolated to non movable zone."? > Which is just what the code does, I'm feeling that it's too detailed to be proper :( Do we have to comment it like that detailedly? thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tang Chen Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 13 May 2013 17:11:43 +0800 Message-ID: <5190AE4F.4000103@cn.fujitsu.com> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Mel Gorman Return-path: In-Reply-To: <20130206095617.GN21389@suse.de> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Mel, On 02/06/2013 05:56 PM, Mel Gorman wrote: > > There is the possibility that callbacks could be introduced for > migrate_unpin() and migrate_pin() that takes a list of PFN pairs > (old,new). The unpin callback should release the old PFNs and barrier > against any operations until the migrate_pfn() callback is called with > the updated pfns to be repinned. Again it would fully depend on subsystems > implementing it properly. > > The callback interface would be more robust but puts a lot more work on > the driver side where your milage will vary. > I'm very interested in the "callback" way you said. For memory hot-remove case, the aio pages are pined in memory and making the pages cannot be offlined, furthermore, the pages cannot be removed. IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio subsystem, and call them when hot-remove code tries to offline pages, right ? If so, I'm wondering where should we put this callback pointers ? In struct page ? It has been a long time since this topic was discussed. But to solve this problem cleanly for hotplug guys and CMA guys, please give some more comments. Thanks. :) > > To guarantee CMA can migrate pages pinned by drivers I think you need > migrate-related callsbacks to unpin, barrier the driver until migration > completes and repin. > > I do not know, or at least have no heard, of anyone working on such a > scheme. > -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 13 May 2013 10:19:02 +0100 Message-ID: <20130513091902.GP11497@suse.de> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Tang Chen Return-path: Content-Disposition: inline In-Reply-To: <5190AE4F.4000103@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote: > Hi Mel, > > On 02/06/2013 05:56 PM, Mel Gorman wrote: > > > >There is the possibility that callbacks could be introduced for > >migrate_unpin() and migrate_pin() that takes a list of PFN pairs > >(old,new). The unpin callback should release the old PFNs and barrier > >against any operations until the migrate_pfn() callback is called with > >the updated pfns to be repinned. Again it would fully depend on subsystems > >implementing it properly. > > > >The callback interface would be more robust but puts a lot more work on > >the driver side where your milage will vary. > > > > I'm very interested in the "callback" way you said. > > For memory hot-remove case, the aio pages are pined in memory and making > the pages cannot be offlined, furthermore, the pages cannot be removed. > > IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio > subsystem, and call them when hot-remove code tries to offline > pages, right ? > > If so, I'm wondering where should we put this callback pointers ? > In struct page ? > No, I would expect the callbacks to be part the address space operations which can be found via page->mapping. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin LaHaise Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 13 May 2013 10:37:57 -0400 Message-ID: <20130513143757.GP31899@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Tang Chen , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Mel Gorman Return-path: Content-Disposition: inline In-Reply-To: <20130513091902.GP11497@suse.de> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, May 13, 2013 at 10:19:02AM +0100, Mel Gorman wrote: > On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote: ... > > If so, I'm wondering where should we put this callback pointers ? > > In struct page ? > > > > No, I would expect the callbacks to be part the address space operations > which can be found via page->mapping. If someone adds those callbacks and provides a means for testing them, it would be pretty trivial to change the aio code to migrate its pinned pages on demand. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Moyer Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 13 May 2013 10:54:03 -0400 Message-ID: References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Mel Gorman , Tang Chen , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise Return-path: In-Reply-To: <20130513143757.GP31899@kvack.org> (Benjamin LaHaise's message of "Mon, 13 May 2013 10:37:57 -0400") Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Benjamin LaHaise writes: > On Mon, May 13, 2013 at 10:19:02AM +0100, Mel Gorman wrote: >> On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote: > ... >> > If so, I'm wondering where should we put this callback pointers ? >> > In struct page ? >> > >> >> No, I would expect the callbacks to be part the address space operations >> which can be found via page->mapping. > > If someone adds those callbacks and provides a means for testing them, > it would be pretty trivial to change the aio code to migrate its pinned > pages on demand. How do you propose to move the ring pages? Cheers, Jeff -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin LaHaise Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 13 May 2013 11:01:47 -0400 Message-ID: <20130513150147.GQ31899@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Mel Gorman , Tang Chen , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Jeff Moyer Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: > How do you propose to move the ring pages? It's the same problem as doing a TLB shootdown: flush the old pages from userspace's mapping, copy any existing data to the new pages, then repopulate the page tables. It will likely require the addition of address_space_operations for the mapping, but that's not too hard to do. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tang Chen Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 14 May 2013 09:24:58 +0800 Message-ID: <5191926A.2090608@cn.fujitsu.com> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> <20130513150147.GQ31899@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise , Jeff Moyer , Mel Gorman Return-path: In-Reply-To: <20130513150147.GQ31899@kvack.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Mel, Benjamin, Jeff, On 05/13/2013 11:01 PM, Benjamin LaHaise wrote: > On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: >> How do you propose to move the ring pages? > > It's the same problem as doing a TLB shootdown: flush the old pages from > userspace's mapping, copy any existing data to the new pages, then > repopulate the page tables. It will likely require the addition of > address_space_operations for the mapping, but that's not too hard to do. > I think we add migrate_unpin() callback to decrease page->count if necessary, and migrate the page to a new page, and add migrate_pin() callback to pin the new page again. The migrate procedure will work just as before. We use callbacks to decrease the page->count before migration starts, and increase it when the migration is done. And migrate_pin() and migrate_unpin() callbacks will be added to struct address_space_operations. Is that right ? If so, I'll be working on it. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tang Chen Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 14 May 2013 11:55:31 +0800 Message-ID: <5191B5B3.7080406@cn.fujitsu.com> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Mel Gorman Return-path: In-Reply-To: <20130513091902.GP11497@suse.de> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Mel, On 05/13/2013 05:19 PM, Mel Gorman wrote: >> For memory hot-remove case, the aio pages are pined in memory and making >> the pages cannot be offlined, furthermore, the pages cannot be removed. >> >> IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio >> subsystem, and call them when hot-remove code tries to offline >> pages, right ? >> >> If so, I'm wondering where should we put this callback pointers ? >> In struct page ? >> > > No, I would expect the callbacks to be part the address space operations > which can be found via page->mapping. > Two more problems I don't quite understand. 1. For an anonymous page, it has no address_space, and no address space operation. But the aio ring problem just happened when dealing with anonymous pages. Please refer to: (https://lkml.org/lkml/2012/11/29/69) If we put the the callbacks in page->mapping->a_ops, the anonymous pages won't be able to use them. And we cannot give a default callback because the situation we are dealing with is a special situation. So where to put the callback for anonymous pages ? 2. How to find out the reason why page->count != 1 in migrate_page_move_mapping() ? In the problem we are dealing with, get_user_pages() is called to pin the pages in memory. And the pages are migratable. So we want to decrease the page->count. But get_user_pages() is not the only reason leading to page->count increased. How can I know when should decrease teh page->count or when should not ? The way I can figure out is to assign the callback pointer in get_user_pages() because it is get_user_pages() who pins the pages. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin LaHaise Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 14 May 2013 09:58:50 -0400 Message-ID: <20130514135850.GG13845@kvack.org> References: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> <20130513150147.GQ31899@kvack.org> <5191926A.2090608@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jeff Moyer , Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Tang Chen Return-path: Content-Disposition: inline In-Reply-To: <5191926A.2090608@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote: > Hi Mel, Benjamin, Jeff, > > On 05/13/2013 11:01 PM, Benjamin LaHaise wrote: > >On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: > >>How do you propose to move the ring pages? > > > >It's the same problem as doing a TLB shootdown: flush the old pages from > >userspace's mapping, copy any existing data to the new pages, then > >repopulate the page tables. It will likely require the addition of > >address_space_operations for the mapping, but that's not too hard to do. > > > > I think we add migrate_unpin() callback to decrease page->count if > necessary, > and migrate the page to a new page, and add migrate_pin() callback to pin > the new page again. You can't just decrease the page count for this to work. The pages are pinned because aio_complete() can occur at any time and needs to have a place to write the completion events. When changing pages, aio has to take the appropriate lock when changing one page for another. > The migrate procedure will work just as before. We use callbacks to > decrease > the page->count before migration starts, and increase it when the migration > is done. > > And migrate_pin() and migrate_unpin() callbacks will be added to > struct address_space_operations. I think the existing migratepage operation in address_space_operations can be used. Does it get called when hot unplug occurs? That is: is testing with the migrate_pages syscall similar enough to the memory removal case? -ben > Is that right ? > > If so, I'll be working on it. > > Thanks. :) -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: chen tang Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 14 May 2013 23:16:41 +0800 Message-ID: References: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> <20130513150147.GQ31899@kvack.org> <5191926A.2090608@cn.fujitsu.com> <20130514135850.GG13845@kvack.org> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=e89a8fb202c2d5c37704dcaf1e5d Cc: Tang Chen , Jeff Moyer , Mel Gorman , Minchan Kim , Lin Feng , Andrew Morton , viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, Yasuaki Ishimatsu , wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, Linux Kernel Mailing List , Marek Szyprowski To: Benjamin LaHaise Return-path: In-Reply-To: <20130514135850.GG13845@kvack.org> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org --e89a8fb202c2d5c37704dcaf1e5d Content-Type: text/plain; charset=ISO-8859-1 Hi Benjamin, Thank you for the explaination. But would you please give me more info about aio ? See below. 2013/5/14 Benjamin LaHaise > On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote: > > Hi Mel, Benjamin, Jeff, > > > > On 05/13/2013 11:01 PM, Benjamin LaHaise wrote: > > >On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: > > >>How do you propose to move the ring pages? > > > > > >It's the same problem as doing a TLB shootdown: flush the old pages from > > >userspace's mapping, copy any existing data to the new pages, then > > >repopulate the page tables. It will likely require the addition of > > >address_space_operations for the mapping, but that's not too hard to do. > > > > > > > I think we add migrate_unpin() callback to decrease page->count if > > necessary, > > and migrate the page to a new page, and add migrate_pin() callback to pin > > the new page again. > > You can't just decrease the page count for this to work. The pages are > pinned because aio_complete() can occur at any time and needs to have a > place to write the completion events. When changing pages, aio has to > take the appropriate lock when changing one page for another. > I saw in aio_complete(), it holds kioctx->ctx_lock. Can we hold this lock when we migrate aio ring pages ? > > > The migrate procedure will work just as before. We use callbacks to > > decrease > > the page->count before migration starts, and increase it when the > migration > > is done. > > > > And migrate_pin() and migrate_unpin() callbacks will be added to > > struct address_space_operations. > > I think the existing migratepage operation in address_space_operations can > be used. Does it get called when hot unplug occurs? That is: is testing > with the migrate_pages syscall similar enough to the memory removal case? > For anonymous pages, they don't have address_space, so they don't have address_space_operations. And aio ring pages are anonymous pages, right ? In move_to_new_page(), kernel will decide which function to call. if (!mapping) rc = migrate_page(mapping, newpage, page, mode); else if (mapping->a_ops->migratepage) rc = mapping->a_ops->migratepage(mapping, newpage, page, mode); else rc = fallback_migrate_page(mapping, newpage, page, mode); And for aio ring pages, it always call migrate_page(), right ? Thanks. :) > > -ben > > > Is that right ? > > > > If so, I'll be working on it. > > > > Thanks. :) > > -- > "Thought is the essence of where you are now." > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > --e89a8fb202c2d5c37704dcaf1e5d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi=A0Benjamin,

Thank you for the explaination. But would you pleas= e give me more info about aio ?
See below.<= br>
2013/5/14 Benjamin LaHaise <= ;bcrl@kvack.org>=
On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote= :
> Hi Mel, Benjamin, Jeff,
>
> On 05/13/2013 11:01 PM, Benjamin LaHaise wrote:
> >On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote:
> >>How do you propose to move the ring pages?
> >
> >It's the same problem as doing a TLB shootdown: flush the old = pages from
> >userspace's mapping, copy any existing data to the new pages, = then
> >repopulate the page tables. =A0It will likely require the addition= of
> >address_space_operations for the mapping, but that's not too h= ard to do.
> >
>
> I think we add migrate_unpin() callback to decrease page->count if<= br> > necessary,
> and migrate the page to a new page, and add migrate_pin() callback to = pin
> the new page again.

You can't just decrease the page count for this to work. =A0The p= ages are
pinned because aio_complete() can occur at any time and needs to have a
place to write the completion events. =A0When changing pages, aio has to take the appropriate lock when changing one page for another.

I saw in aio_complete(), it holds=A0kioctx->= ;ctx_lock. Can we hold this lock when=A0
we migrate aio rin= g pages ?
=A0

> The migrate procedure will work just as before. We use callbacks to > decrease
> the page->count before migration starts, and increase it when the m= igration
> is done.
>
> And migrate_pin() and migrate_unpin() callbacks will be added to
> struct address_space_operations.

I think the existing migratepage operation in address_space_operation= s can
be used. =A0Does it get called when hot unplug occurs? =A0That is: is testi= ng
with the migrate_pages syscall similar enough to the memory removal case?

For anonymous pages, they don'= t have address_space, so they don't have=A0
address_spa= ce_operations. And aio ring pages are anonymous pages, right ?

In move_to_new_page(), kernel will de= cide which function to call.

if (!mapping)
rc =3D migrate_page(mapping= , newpage, page, mode);
else if (mapping-&g= t;a_ops->migratepage)
rc =3D mapping->a_ops->migratepage(mapping,
newpage, page, mode);<= /div>
else
rc =3D fallback_migrate_pag= e(mapping, newpage, page, mode);

And for aio= ring pages, it always call migrate_page(), right ?

Thanks. :)
=A0

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 -ben

> Is that right ?
>
> If so, I'll be working on it.
>
> Thanks. :)

--
"Thought is the essence of where you are now.&= quot;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel= " in
the body of a message to major= domo@vger.kernel.org
More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
Please read the FAQ at =A0http://www.tux.org/lkml/

--e89a8fb202c2d5c37704dcaf1e5d-- -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tang Chen Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 15 May 2013 10:09:04 +0800 Message-ID: <5192EE40.7060407@cn.fujitsu.com> References: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> <20130513150147.GQ31899@kvack.org> <5191926A.2090608@cn.fujitsu.com> <20130514135850.GG13845@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Jeff Moyer , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise , Mel Gorman Return-path: In-Reply-To: <20130514135850.GG13845@kvack.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Benjamin, Mel, Please see below. On 05/14/2013 09:58 PM, Benjamin LaHaise wrote: > On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote: >> Hi Mel, Benjamin, Jeff, >> >> On 05/13/2013 11:01 PM, Benjamin LaHaise wrote: >>> On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: >>>> How do you propose to move the ring pages? >>> >>> It's the same problem as doing a TLB shootdown: flush the old pages from >>> userspace's mapping, copy any existing data to the new pages, then >>> repopulate the page tables. It will likely require the addition of >>> address_space_operations for the mapping, but that's not too hard to do. >>> >> >> I think we add migrate_unpin() callback to decrease page->count if >> necessary, >> and migrate the page to a new page, and add migrate_pin() callback to pin >> the new page again. > > You can't just decrease the page count for this to work. The pages are > pinned because aio_complete() can occur at any time and needs to have a > place to write the completion events. When changing pages, aio has to > take the appropriate lock when changing one page for another. In aio_complete(), aio_complete() { ...... spin_lock_irqsave(&ctx->completion_lock, flags); //write the completion event. spin_unlock_irqrestore(&ctx->completion_lock, flags); ...... } So for this problem, I think we can hold ctx->completion_lock in the aio callbacks to prevent aio subsystem accessing pages who are being migrated. > >> The migrate procedure will work just as before. We use callbacks to >> decrease >> the page->count before migration starts, and increase it when the migration >> is done. >> >> And migrate_pin() and migrate_unpin() callbacks will be added to >> struct address_space_operations. > > I think the existing migratepage operation in address_space_operations can > be used. Does it get called when hot unplug occurs? That is: is testing > with the migrate_pages syscall similar enough to the memory removal case? > But as I said, for anonymous pages such as aio ring buffer, they don't have address_space_operations. So where should we put the callbacks' pointers ? Add something like address_space_operations to struct anon_vma ? Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tang Chen Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 15 May 2013 15:21:34 +0800 Message-ID: <5193377E.30102@cn.fujitsu.com> References: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> <20130513150147.GQ31899@kvack.org> <5191926A.2090608@cn.fujitsu.com> <20130514135850.GG13845@kvack.org> <5192EE40.7060407@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Jeff Moyer , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise , Mel Gorman Return-path: In-Reply-To: <5192EE40.7060407@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Benjamin, Mel, On 05/15/2013 10:09 AM, Tang Chen wrote: > Hi Benjamin, Mel, > > Please see below. > > On 05/14/2013 09:58 PM, Benjamin LaHaise wrote: >> On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote: >>> Hi Mel, Benjamin, Jeff, >>> >>> On 05/13/2013 11:01 PM, Benjamin LaHaise wrote: >>>> On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: >>>>> How do you propose to move the ring pages? >>>> >>>> It's the same problem as doing a TLB shootdown: flush the old pages >>>> from >>>> userspace's mapping, copy any existing data to the new pages, then >>>> repopulate the page tables. It will likely require the addition of >>>> address_space_operations for the mapping, but that's not too hard to >>>> do. >>>> >>> >>> I think we add migrate_unpin() callback to decrease page->count if >>> necessary, >>> and migrate the page to a new page, and add migrate_pin() callback to >>> pin >>> the new page again. >> >> You can't just decrease the page count for this to work. The pages are >> pinned because aio_complete() can occur at any time and needs to have a >> place to write the completion events. When changing pages, aio has to >> take the appropriate lock when changing one page for another. > > In aio_complete(), > > aio_complete() { > ...... > spin_lock_irqsave(&ctx->completion_lock, flags); > //write the completion event. > spin_unlock_irqrestore(&ctx->completion_lock, flags); > ...... > } > > So for this problem, I think we can hold kioctx->completion_lock in the aio > callbacks to prevent aio subsystem accessing pages who are being migrated. > Another problem here is: We intend to call these callbacks in the page migrate path, and we need to know which lock to hold. But there is no way for migrate path to know this info. The migrate path is common for all kinds of pages, so we cannot pass any specific parameter to the callbacks in migrate path. When we get a page, we cannot get any kioctx info from the page. So how can the callback know which lock to require without any parameter ? Or do we have any other way to do so ? Would you please give some more advice about this ? BTW, we also need to update kioctx->ring_pages. Thanks. :) >> >>> The migrate procedure will work just as before. We use callbacks to >>> decrease >>> the page->count before migration starts, and increase it when the >>> migration >>> is done. >>> >>> And migrate_pin() and migrate_unpin() callbacks will be added to >>> struct address_space_operations. >> >> I think the existing migratepage operation in address_space_operations >> can >> be used. Does it get called when hot unplug occurs? That is: is testing >> with the migrate_pages syscall similar enough to the memory removal case? >> > > But as I said, for anonymous pages such as aio ring buffer, they don't have > address_space_operations. So where should we put the callbacks' pointers ? > > Add something like address_space_operations to struct anon_vma ? > > Thanks. :) > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mel Gorman Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 15 May 2013 14:24:53 +0100 Message-ID: <20130515132453.GB11497@suse.de> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Tang Chen Return-path: Received: from cantor2.suse.de ([195.135.220.15]:58918 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932473Ab3EONZB (ORCPT ); Wed, 15 May 2013 09:25:01 -0400 Content-Disposition: inline In-Reply-To: <5191B5B3.7080406@cn.fujitsu.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Tue, May 14, 2013 at 11:55:31AM +0800, Tang Chen wrote: > Hi Mel, > > On 05/13/2013 05:19 PM, Mel Gorman wrote: > >>For memory hot-remove case, the aio pages are pined in memory and making > >>the pages cannot be offlined, furthermore, the pages cannot be removed. > >> > >>IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio > >>subsystem, and call them when hot-remove code tries to offline > >>pages, right ? > >> > >>If so, I'm wondering where should we put this callback pointers ? > >>In struct page ? > >> > > > >No, I would expect the callbacks to be part the address space operations > >which can be found via page->mapping. > > > > Two more problems I don't quite understand. > Bear in mind I've done no research on this particular problem. At best, the migrate pin/unpin is the direction that I'd start with if I was tasked with fixing this (which I'm not). Hence, I cannot answer your questions at the level of detail you are looking for. > 1. For an anonymous page, it has no address_space, and no address space > operation. But the aio ring problem just happened when dealing with > anonymous pages. Please refer to: > (https://lkml.org/lkml/2012/11/29/69) > If it is to be an address space operations sturcture then you'll need a pseudo mapping structure for anonymous pages that are pinned by aio -- similar in principal to how swapper_space is used for managing PageSwapCache or how anon_vma structures can be associated with a page. However, I warn you that you may find that the address_space is the wrong level to register such callbacks, it just seemed like the obvious first choice. A potential alternative implementation is to create a 1:1 association between pages and a long-lived holder that is stored on a hash table (similar style of arrangement as page_waitqueue). A page is looked up in the hash table and if an entry exists, it points to an callback structure to the subsystem holding the pin. It's up to the subsystem to register the callbacks when it is about to pin a page (get_user_pages_longlived(...., &release_ops) and figure out how to release the pin safely. > 2. How to find out the reason why page->count != 1 in > migrate_page_move_mapping() ? > > In the problem we are dealing with, get_user_pages() is called to > pin the pages > in memory. And the pages are migratable. So we want to decrease > the page->count. > > But get_user_pages() is not the only reason leading to > page->count increased. > How can I know when should decrease teh page->count or when should not ? > You cannot just arbitrarily drop the page->count without causing problems. It has to be released by the subsystem holding the pin because only it can know when it's safe. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tang Chen Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Thu, 16 May 2013 13:54:18 +0800 Message-ID: <5194748A.5070700@cn.fujitsu.com> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Mel Gorman Return-path: In-Reply-To: <20130515132453.GB11497@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org Hi Mel, On 05/15/2013 09:24 PM, Mel Gorman wrote: > If it is to be an address space operations sturcture then you'll need a > pseudo mapping structure for anonymous pages that are pinned by aio -- > similar in principal to how swapper_space is used for managing PageSwapCache > or how anon_vma structures can be associated with a page. > > However, I warn you that you may find that the address_space is the > wrong level to register such callbacks, it just seemed like the obvious > first choice. A potential alternative implementation is to create a 1:1 > association between pages and a long-lived holder that is stored on a hash > table (similar style of arrangement as page_waitqueue). A page is looked up > in the hash table and if an entry exists, it points to an callback structure > to the subsystem holding the pin. It's up to the subsystem to register the > callbacks when it is about to pin a page (get_user_pages_longlived(...., > &release_ops) and figure out how to release the pin safely. > OK, I'll try to figure out a proper place to put the callbacks. But I think we need to add something new to struct page. I'm just not sure if it is OK. Maybe we can discuss more about it when I send a RFC patch. Thanks for the advices, and I'll try them. Thanks. :) From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin LaHaise Subject: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Thu, 16 May 2013 20:23:49 -0400 Message-ID: <20130517002349.GI1008@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Tang Chen Return-path: Content-Disposition: inline In-Reply-To: <5194748A.5070700@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu, May 16, 2013 at 01:54:18PM +0800, Tang Chen wrote: ... > OK, I'll try to figure out a proper place to put the callbacks. > But I think we need to add something new to struct page. I'm just > not sure if it is OK. Maybe we can discuss more about it when I send > a RFC patch. ... I ended up working on this a bit today, and managed to cobble together something that somewhat works -- please see the patch below. It still is not completely tested, and it has a rather nasty bug owing to the fact that the file descriptors returned by anon_inode_getfile() all share the same inode (read: more than one instance of aio does not work), but it shows the basic idea. Also, bad things probably happen if someone does an mremap() on the aio ring buffer. I'll polish this off sometime next week after the long weekend if noone beats me to it. -ben -- "Thought is the essence of where you are now." fs/aio.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/migrate.h | 3 + mm/migrate.c | 2 - 3 files changed, 96 insertions(+), 4 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index c5b1a8c..dbad23e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -35,6 +35,9 @@ #include #include #include +#include +#include +#include #include #include @@ -108,6 +111,7 @@ struct kioctx { } ____cacheline_aligned_in_smp; struct page *internal_pages[AIO_RING_PAGES]; + struct file *ctx_file; }; /*------ sysctl variables----*/ @@ -146,8 +150,59 @@ static void aio_free_ring(struct kioctx *ctx) if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) kfree(ctx->ring_pages); + + if (ctx->ctx_file) { + truncate_setsize(ctx->ctx_file->f_inode, 0); + fput(ctx->ctx_file); + ctx->ctx_file = NULL; + } +} + +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma->vm_ops = &generic_file_vm_ops; + return 0; +} + +static const struct file_operations aio_ctx_fops = { + .mmap = aio_ctx_mmap, +}; + +static int aio_set_page_dirty(struct page *page) +{ + return 0; +} + +static int aio_migratepage(struct address_space *mapping, struct page *new, + struct page *old, enum migrate_mode mode) +{ + struct kioctx *ctx = mapping->private_data; + unsigned long flags; + unsigned idx = old->index; + int rc; + + BUG_ON(PageWriteback(old)); /* Writeback must be complete */ + put_page(old); + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); + if (rc != MIGRATEPAGE_SUCCESS) { + get_page(old); + return rc; + } + get_page(new); + + spin_lock_irqsave(&ctx->completion_lock, flags); + migrate_page_copy(new, old); + ctx->ring_pages[idx] = new; + spin_unlock_irqrestore(&ctx->completion_lock, flags); + + return MIGRATEPAGE_SUCCESS; } +static const struct address_space_operations aio_ctx_aops = { + .set_page_dirty = aio_set_page_dirty, + .migratepage = aio_migratepage, +}; + static int aio_setup_ring(struct kioctx *ctx) { struct aio_ring *ring; @@ -155,6 +210,7 @@ static int aio_setup_ring(struct kioctx *ctx) struct mm_struct *mm = current->mm; unsigned long size, populate; int nr_pages; + int i; /* Compensate for the ring buffer's head/tail overlap entry */ nr_events += 2; /* 1 is required, 2 for good luck */ @@ -166,6 +222,31 @@ static int aio_setup_ring(struct kioctx *ctx) if (nr_pages < 0) return -EINVAL; + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); + if (IS_ERR(ctx->ctx_file)) { + ctx->ctx_file = NULL; + return -EAGAIN; + } + ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops; + ctx->ctx_file->f_inode->i_mapping->private_data = ctx; + ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages; + + for (i=0; ictx_file->f_inode->i_mapping, + i, GFP_KERNEL); + if (!page) { + break; + } + ptr = kmap(page); + clear_page(ptr); + kunmap(page); + SetPageUptodate(page); + SetPageDirty(page); + unlock_page(page); + } + nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); ctx->nr_events = 0; @@ -180,20 +261,25 @@ static int aio_setup_ring(struct kioctx *ctx) ctx->mmap_size = nr_pages * PAGE_SIZE; pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size); down_write(&mm->mmap_sem); - ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size, + ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size, PROT_READ|PROT_WRITE, - MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate); + MAP_SHARED|MAP_POPULATE, 0, + &populate); if (IS_ERR((void *)ctx->mmap_base)) { up_write(&mm->mmap_sem); ctx->mmap_size = 0; aio_free_ring(ctx); return -EAGAIN; } + up_write(&mm->mmap_sem); + mm_populate(ctx->mmap_base, populate); pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, 1, 0, ctx->ring_pages, NULL); - up_write(&mm->mmap_sem); + for (i=0; inr_pages; i++) { + put_page(ctx->ring_pages[i]); + } if (unlikely(ctx->nr_pages != nr_pages)) { aio_free_ring(ctx); @@ -403,6 +489,8 @@ out_cleanup: err = -EAGAIN; aio_free_ring(ctx); out_freectx: + if (ctx->ctx_file) + fput(ctx->ctx_file); kmem_cache_free(kioctx_cachep, ctx); pr_debug("error allocating ioctx %d\n", err); return ERR_PTR(err); @@ -852,6 +940,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) ioctx = ioctx_alloc(nr_events); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { + ctx = ioctx->user_id; ret = put_user(ioctx->user_id, ctxp); if (ret) kill_ioctx(ioctx); diff --git a/include/linux/migrate.h b/include/linux/migrate.h index a405d3dc..b6f3289 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm, extern void migrate_page_copy(struct page *newpage, struct page *page); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); +extern int migrate_page_move_mapping(struct address_space *mapping, + struct page *newpage, struct page *page, + struct buffer_head *head, enum migrate_mode mode); #else static inline void putback_lru_pages(struct list_head *l) {} diff --git a/mm/migrate.c b/mm/migrate.c index 27ed225..ac9c3a9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, * 2 for pages with a mapping * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. */ -static int migrate_page_move_mapping(struct address_space *mapping, +int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, struct buffer_head *head, enum migrate_mode mode) { -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tang Chen Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Fri, 17 May 2013 11:28:52 +0800 Message-ID: <5195A3F4.70803@cn.fujitsu.com> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise Return-path: In-Reply-To: <20130517002349.GI1008@kvack.org> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Benjamin, Thank you very much for your idea. :) I have no objection to your idea, but seeing from your patch, this only works for aio subsystem because you changed the way to allocate the aio ring pages, with a file mapping. So far as I know, not only aio, but also other subsystems, such CMA, will also have problem like this. The page cannot be migrated because it is pinned in memory. So I think we should work out a common way to solve how to migrate pinned pages. I'm working in the way Mel has said, migrate_unpin() and migrate_pin() callbacks. But as you saw, I met some problems, like I don't where to put these two callbacks. And discussed with you guys, I want to try this: 1. Add a new member to struct page, used to remember the pin holders of this page, including the pin and unpin callbacks and the necessary data. This is more like a callback chain. (I'm worry about this step, I'm not sure if it is good enough. After all, we need a good place to put the callbacks.) And then, like Mel said, 2. Implement the callbacks in the subsystems, and register them to the new member in struct page. 3. Call these callbacks before and after migration. I think I'll send a RFC patch next week when I finished the outline. I'm just thinking of finding a common way to solve this problem that all the other subsystems will benefit. Thanks. :) On 05/17/2013 08:23 AM, Benjamin LaHaise wrote: > On Thu, May 16, 2013 at 01:54:18PM +0800, Tang Chen wrote: > ... >> OK, I'll try to figure out a proper place to put the callbacks. >> But I think we need to add something new to struct page. I'm just >> not sure if it is OK. Maybe we can discuss more about it when I send >> a RFC patch. > ... > > I ended up working on this a bit today, and managed to cobble together > something that somewhat works -- please see the patch below. It still is > not completely tested, and it has a rather nasty bug owing to the fact > that the file descriptors returned by anon_inode_getfile() all share the > same inode (read: more than one instance of aio does not work), but it > shows the basic idea. Also, bad things probably happen if someone does > an mremap() on the aio ring buffer. I'll polish this off sometime next > week after the long weekend if noone beats me to it. > > -ben -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin LaHaise Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Fri, 17 May 2013 10:37:18 -0400 Message-ID: <20130517143718.GK1008@kvack.org> References: <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Tang Chen Return-path: Content-Disposition: inline In-Reply-To: <5195A3F4.70803@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, May 17, 2013 at 11:28:52AM +0800, Tang Chen wrote: > Hi Benjamin, > > Thank you very much for your idea. :) > > I have no objection to your idea, but seeing from your patch, this only > works for aio subsystem because you changed the way to allocate the aio > ring pages, with a file mapping. That is correct. There is no way you're going to be able to solve this problem without dealing with the issue on a subsystem by subsystem basis. > So far as I know, not only aio, but also other subsystems, such CMA, will > also have problem like this. The page cannot be migrated because it is > pinned in memory. So I think we should work out a common way to solve how > to migrate pinned pages. A generic approach would require hardware support, but I doubt that is going to happen. > I'm working in the way Mel has said, migrate_unpin() and migrate_pin() > callbacks. But as you saw, I met some problems, like I don't where to put > these two callbacks. And discussed with you guys, I want to try this: > > 1. Add a new member to struct page, used to remember the pin holders of > this page, including the pin and unpin callbacks and the necessary data. > This is more like a callback chain. > (I'm worry about this step, I'm not sure if it is good enough. After > all, > we need a good place to put the callbacks.) Putting function pointers into struct page is not going to happen. You'd be adding a significant amount of memory overhead for something that is never going to be used on the vast majority of systems (2 function pointers would be 16 bytes per page on a 64 bit system). Keep in mind that distro kernels tend to enable almost all config options on their kernels, so the overhead of any approach has to make sense for the users of the kernel that will never make use of this kind of migration. > And then, like Mel said, > > 2. Implement the callbacks in the subsystems, and register them to the > new member in struct page. No, the hook should be in the address_space_operations. We already have a pointer to an address space in struct page. This avoids adding more overhead to struct page. > 3. Call these callbacks before and after migration. How is that better than using the existing hook in address_space_operations? > I think I'll send a RFC patch next week when I finished the outline. I'm > just thinking of finding a common way to solve this problem that all the > other subsystems will benefit. Before pursuing this approach, make sure you've got buy-in for all of the overhead you're adding to the system. I don't think that growing struct page is going to be an acceptable design choice given the amount of overhead it will incur. > Thanks. :) Cheers, -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Zach Brown Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Fri, 17 May 2013 11:17:08 -0700 Message-ID: <20130517181708.GG318@lenny.home.zabbo.net> References: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Tang Chen , Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise Return-path: Content-Disposition: inline In-Reply-To: <20130517002349.GI1008@kvack.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org > I ended up working on this a bit today, and managed to cobble together > something that somewhat works -- please see the patch below. Just some quick observations: > + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); > + if (IS_ERR(ctx->ctx_file)) { > + ctx->ctx_file = NULL; > + return -EAGAIN; > + } It's too bad that aio contexts will now be accounted against the filp limits (get_empty_filp -> files_stat.max_files, etc). > + for (i=0; i + struct page *page; > + void *ptr; > + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping, > + i, GFP_KERNEL); > + if (!page) { > + break; > + } > + ptr = kmap(page); > + clear_page(ptr); > + kunmap(page); > + SetPageUptodate(page); > + SetPageDirty(page); > + unlock_page(page); > + } If they're GFP_KERNEL then you don't need to kmap them. But we probably want to allocate with GFP_HIGHUSER and then use clear_user_highpage() to zero them? - z -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin LaHaise Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Fri, 17 May 2013 14:30:03 -0400 Message-ID: <20130517183003.GL1008@kvack.org> References: <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <20130517181708.GG318@lenny.home.zabbo.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Tang Chen , Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Zach Brown Return-path: Content-Disposition: inline In-Reply-To: <20130517181708.GG318@lenny.home.zabbo.net> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, May 17, 2013 at 11:17:08AM -0700, Zach Brown wrote: > > I ended up working on this a bit today, and managed to cobble together > > something that somewhat works -- please see the patch below. > > Just some quick observations: > > > + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); > > + if (IS_ERR(ctx->ctx_file)) { > > + ctx->ctx_file = NULL; > > + return -EAGAIN; > > + } > > It's too bad that aio contexts will now be accounted against the filp > limits (get_empty_filp -> files_stat.max_files, etc). Yeah, that is a downside of this approach. It would be possible to to do it with only an inode/address_space, but that would mean bypassing do_mmap(), which is not worth considering. If it is really an issue, we could add a flag to bypass that limit since aio has its own. anon_inode_getfile() as it stands is a major problem. > > + for (i=0; i > + struct page *page; > > + void *ptr; > > + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping, > > + i, GFP_KERNEL); > > + if (!page) { > > + break; > > + } > > + ptr = kmap(page); > > + clear_page(ptr); > > + kunmap(page); > > + SetPageUptodate(page); > > + SetPageDirty(page); > > + unlock_page(page); > > + } > > If they're GFP_KERNEL then you don't need to kmap them. But we probably > want to allocate with GFP_HIGHUSER and then use clear_user_highpage() to > zero them? Adding __GFP_ZERO would fix that too. The next respin will include that change. I also have to properly handle the mremap() case as well. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tang Chen Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Tue, 21 May 2013 10:07:52 +0800 Message-ID: <519AD6F8.2070504@cn.fujitsu.com> References: <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise Return-path: In-Reply-To: <20130517143718.GK1008@kvack.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Benjamin, Sorry for the late. Please see below. On 05/17/2013 10:37 PM, Benjamin LaHaise wrote: > On Fri, May 17, 2013 at 11:28:52AM +0800, Tang Chen wrote: >> Hi Benjamin, >> >> Thank you very much for your idea. :) >> >> I have no objection to your idea, but seeing from your patch, this only >> works for aio subsystem because you changed the way to allocate the aio >> ring pages, with a file mapping. > > That is correct. There is no way you're going to be able to solve this > problem without dealing with the issue on a subsystem by subsystem basis. > Yes, I understand that. We need subsystem work anyway. >> I'm working in the way Mel has said, migrate_unpin() and migrate_pin() >> callbacks. But as you saw, I met some problems, like I don't where to put >> these two callbacks. And discussed with you guys, I want to try this: >> >> 1. Add a new member to struct page, used to remember the pin holders of >> this page, including the pin and unpin callbacks and the necessary data. >> This is more like a callback chain. >> (I'm worry about this step, I'm not sure if it is good enough. After >> all, >> we need a good place to put the callbacks.) > > Putting function pointers into struct page is not going to happen. You'd > be adding a significant amount of memory overhead for something that is > never going to be used on the vast majority of systems (2 function pointers > would be 16 bytes per page on a 64 bit system). Keep in mind that distro > kernels tend to enable almost all config options on their kernels, so the > overhead of any approach has to make sense for the users of the kernel that > will never make use of this kind of migration. True. But I just cannot find a place to hold the callbacks. > >> 3. Call these callbacks before and after migration. > > How is that better than using the existing hook in address_space_operations? I'm not saying using two callbacks before and after migration is better. I don't want to use address_space_operations is because there is no such member for anonymous pages. In your idea, using a file mapping will create a address_space_operations. But I really don't think we can modify the way of memory allocation for all the subsystems who has this problem. Maybe not just aio and cma. That means if you want to pin pages in memory, you have to use a file mapping. This makes the memory allocation more complicated. And the idea should be known by all the subsystem developers. Is that going to happen ? I also thought about reuse one field of struct page. But as you said, there may not be many users of this functionality. Reusing a field of struct page will make things more complicated and lead to high coupling. So, how about the other idea that Mel mentioned ? We create a 1-1 mapping of pinned page ranges and the pinner (subsystem callbacks and data), maybe a global list or a hash table. And then, we can find the callbacks. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin LaHaise Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Mon, 20 May 2013 22:27:33 -0400 Message-ID: <20130521022733.GT1008@kvack.org> References: <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Tang Chen Return-path: Content-Disposition: inline In-Reply-To: <519AD6F8.2070504@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, May 21, 2013 at 10:07:52AM +0800, Tang Chen wrote: .... > I'm not saying using two callbacks before and after migration is better. > I don't want to use address_space_operations is because there is no such > member > for anonymous pages. That depends on the nature of the pinning. For the general case of get_user_pages(), you're correct that it won't work for anonymous memory. > In your idea, using a file mapping will create a > address_space_operations. But > I really don't think we can modify the way of memory allocation for all the > subsystems who has this problem. Maybe not just aio and cma. That means if > you want to pin pages in memory, you have to use a file mapping. This makes > the memory allocation more complicated. And the idea should be known by all > the subsystem developers. Is that going to happen ? Different subsystems will need to use different approaches to fixing the issue. I doubt any single approach will work for everything. > I also thought about reuse one field of struct page. But as you said, there > may not be many users of this functionality. Reusing a field of struct page > will make things more complicated and lead to high coupling. What happens when more than one subsystem tries to pin a particular page? What if it's a shared page rather than an anonymous page? > So, how about the other idea that Mel mentioned ? > > We create a 1-1 mapping of pinned page ranges and the pinner (subsystem > callbacks and data), maybe a global list or a hash table. And then, we can > find the callbacks. Maybe that is the simplest approach, but it's going to make get_user_pages() slower and more complicated (as if it wasn't already). Maybe with all the bells and whistles of per-cpu data structures and such you can make it work, but I'm pretty sure someone running the large unmentionable benchmark will complain about the performance regressions you're going to introduce. At least in the case of the AIO ring buffer, using the address_space approach doesn't introduce any new performance issues. There's also the bigger question of if you can or cannot exclude get_user_pages_fast() from this. In short: you've got a lot more work on your hands to do. > Thanks. :) Cheers, -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tang Chen Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Tue, 11 Jun 2013 17:42:31 +0800 Message-ID: <51B6F107.80501@cn.fujitsu.com> References: <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise Return-path: In-Reply-To: <20130521022733.GT1008@kvack.org> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Benjamin, Are you still working on this problem ? Thanks. :) On 05/21/2013 10:27 AM, Benjamin LaHaise wrote: > On Tue, May 21, 2013 at 10:07:52AM +0800, Tang Chen wrote: > .... >> I'm not saying using two callbacks before and after migration is better. >> I don't want to use address_space_operations is because there is no such >> member >> for anonymous pages. > > That depends on the nature of the pinning. For the general case of > get_user_pages(), you're correct that it won't work for anonymous memory. > >> In your idea, using a file mapping will create a >> address_space_operations. But >> I really don't think we can modify the way of memory allocation for all the >> subsystems who has this problem. Maybe not just aio and cma. That means if >> you want to pin pages in memory, you have to use a file mapping. This makes >> the memory allocation more complicated. And the idea should be known by all >> the subsystem developers. Is that going to happen ? > > Different subsystems will need to use different approaches to fixing the > issue. I doubt any single approach will work for everything. > >> I also thought about reuse one field of struct page. But as you said, there >> may not be many users of this functionality. Reusing a field of struct page >> will make things more complicated and lead to high coupling. > > What happens when more than one subsystem tries to pin a particular page? > What if it's a shared page rather than an anonymous page? > >> So, how about the other idea that Mel mentioned ? >> >> We create a 1-1 mapping of pinned page ranges and the pinner (subsystem >> callbacks and data), maybe a global list or a hash table. And then, we can >> find the callbacks. > > Maybe that is the simplest approach, but it's going to make get_user_pages() > slower and more complicated (as if it wasn't already). Maybe with all the > bells and whistles of per-cpu data structures and such you can make it work, > but I'm pretty sure someone running the large unmentionable benchmark will > complain about the performance regressions you're going to introduce. At > least in the case of the AIO ring buffer, using the address_space approach > doesn't introduce any new performance issues. There's also the bigger > question of if you can or cannot exclude get_user_pages_fast() from this. > In short: you've got a lot more work on your hands to do. > >> Thanks. :) > > Cheers, > > -ben -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin LaHaise Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Tue, 11 Jun 2013 10:45:25 -0400 Message-ID: <20130611144525.GB14404@kvack.org> References: <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Tang Chen Return-path: Content-Disposition: inline In-Reply-To: <51B6F107.80501@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Tang, On Tue, Jun 11, 2013 at 05:42:31PM +0800, Tang Chen wrote: > Hi Benjamin, > > Are you still working on this problem ? > > Thanks. :) Below is a copy of the most recent version of this patch I have worked on. This version works and stands up to my testing using move_pages() to force the migration of the aio ring buffer. A test program is available at http://www.kvack.org/~bcrl/aio/aio-numa-test.c . Please note that this version is not suitable for mainline as the modifactions to the anon inode code are undesirable, so that part needs reworking. -ben fs/aio.c | 113 ++++++++++++++++++++++++++++++++++++++++++++---- fs/anon_inodes.c | 14 ++++- include/linux/migrate.h | 3 + mm/migrate.c | 2 mm/swap.c | 1 5 files changed, 121 insertions(+), 12 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index c5b1a8c..a951690 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -35,6 +35,9 @@ #include #include #include +#include +#include +#include #include #include @@ -108,6 +111,7 @@ struct kioctx { } ____cacheline_aligned_in_smp; struct page *internal_pages[AIO_RING_PAGES]; + struct file *ctx_file; }; /*------ sysctl variables----*/ @@ -136,18 +140,80 @@ __initcall(aio_setup); static void aio_free_ring(struct kioctx *ctx) { - long i; - - for (i = 0; i < ctx->nr_pages; i++) - put_page(ctx->ring_pages[i]); + int i; if (ctx->mmap_size) vm_munmap(ctx->mmap_base, ctx->mmap_size); + if (ctx->ctx_file) + truncate_setsize(ctx->ctx_file->f_inode, 0); + + for (i = 0; i < ctx->nr_pages; i++) { + pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, + page_count(ctx->ring_pages[i])); + put_page(ctx->ring_pages[i]); + } + if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) kfree(ctx->ring_pages); + + if (ctx->ctx_file) { + truncate_setsize(ctx->ctx_file->f_inode, 0); + pr_debug("pid(%d) i_nlink=%u d_count=%d, d_unhashed=%d i_count=%d\n", + current->pid, ctx->ctx_file->f_inode->i_nlink, + ctx->ctx_file->f_path.dentry->d_count, + d_unhashed(ctx->ctx_file->f_path.dentry), + atomic_read(&ctx->ctx_file->f_path.dentry->d_inode->i_count)); + fput(ctx->ctx_file); + ctx->ctx_file = NULL; + } +} + +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma->vm_ops = &generic_file_vm_ops; + return 0; +} + +static const struct file_operations aio_ctx_fops = { + .mmap = aio_ctx_mmap, +}; + +static int aio_set_page_dirty(struct page *page) +{ + return 0; +} + +static int aio_migratepage(struct address_space *mapping, struct page *new, + struct page *old, enum migrate_mode mode) +{ + struct kioctx *ctx = mapping->private_data; + unsigned long flags; + unsigned idx = old->index; + int rc; + + BUG_ON(PageWriteback(old)); /* Writeback must be complete */ + put_page(old); + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); + if (rc != MIGRATEPAGE_SUCCESS) { + get_page(old); + return rc; + } + get_page(new); + + spin_lock_irqsave(&ctx->completion_lock, flags); + migrate_page_copy(new, old); + ctx->ring_pages[idx] = new; + spin_unlock_irqrestore(&ctx->completion_lock, flags); + + return MIGRATEPAGE_SUCCESS; } +static const struct address_space_operations aio_ctx_aops = { + .set_page_dirty = aio_set_page_dirty, + .migratepage = aio_migratepage, +}; + static int aio_setup_ring(struct kioctx *ctx) { struct aio_ring *ring; @@ -155,6 +221,7 @@ static int aio_setup_ring(struct kioctx *ctx) struct mm_struct *mm = current->mm; unsigned long size, populate; int nr_pages; + int i; /* Compensate for the ring buffer's head/tail overlap entry */ nr_events += 2; /* 1 is required, 2 for good luck */ @@ -166,6 +233,28 @@ static int aio_setup_ring(struct kioctx *ctx) if (nr_pages < 0) return -EINVAL; + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); + if (IS_ERR(ctx->ctx_file)) { + ctx->ctx_file = NULL; + return -EAGAIN; + } + ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops; + ctx->ctx_file->f_inode->i_mapping->private_data = ctx; + ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages; + + for (i=0; ictx_file->f_inode->i_mapping, + i, GFP_HIGHUSER | __GFP_ZERO); + if (!page) + break; + pr_debug("pid(%d) page[%d]->count=%d\n", + current->pid, i, page_count(page)); + SetPageUptodate(page); + SetPageDirty(page); + unlock_page(page); + } + nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); ctx->nr_events = 0; @@ -180,20 +269,25 @@ static int aio_setup_ring(struct kioctx *ctx) ctx->mmap_size = nr_pages * PAGE_SIZE; pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size); down_write(&mm->mmap_sem); - ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size, - PROT_READ|PROT_WRITE, - MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate); + ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, 0, + &populate); if (IS_ERR((void *)ctx->mmap_base)) { up_write(&mm->mmap_sem); ctx->mmap_size = 0; aio_free_ring(ctx); return -EAGAIN; } + up_write(&mm->mmap_sem); + mm_populate(ctx->mmap_base, populate); pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, 1, 0, ctx->ring_pages, NULL); - up_write(&mm->mmap_sem); + for (i=0; inr_pages; i++) { + put_page(ctx->ring_pages[i]); + } if (unlikely(ctx->nr_pages != nr_pages)) { aio_free_ring(ctx); @@ -403,6 +497,8 @@ out_cleanup: err = -EAGAIN; aio_free_ring(ctx); out_freectx: + if (ctx->ctx_file) + fput(ctx->ctx_file); kmem_cache_free(kioctx_cachep, ctx); pr_debug("error allocating ioctx %d\n", err); return ERR_PTR(err); @@ -852,6 +948,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) ioctx = ioctx_alloc(nr_events); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { + ctx = ioctx->user_id; ret = put_user(ioctx->user_id, ctxp); if (ret) kill_ioctx(ioctx); diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 47a65df..376d289 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -131,6 +131,7 @@ struct file *anon_inode_getfile(const char *name, struct qstr this; struct path path; struct file *file; + struct inode *inode; if (IS_ERR(anon_inode_inode)) return ERR_PTR(-ENODEV); @@ -138,6 +139,12 @@ struct file *anon_inode_getfile(const char *name, if (fops->owner && !try_module_get(fops->owner)) return ERR_PTR(-ENOENT); + inode = anon_inode_mkinode(anon_inode_inode->i_sb); + if (IS_ERR(inode)) { + file = ERR_PTR(-ENOMEM); + goto err_module; + } + /* * Link the inode to a directory entry by creating a unique name * using the inode sequence number. @@ -155,17 +162,18 @@ struct file *anon_inode_getfile(const char *name, * We know the anon_inode inode count is always greater than zero, * so ihold() is safe. */ - ihold(anon_inode_inode); + //ihold(inode); - d_instantiate(path.dentry, anon_inode_inode); + d_instantiate(path.dentry, inode); file = alloc_file(&path, OPEN_FMODE(flags), fops); if (IS_ERR(file)) goto err_dput; - file->f_mapping = anon_inode_inode->i_mapping; + file->f_mapping = inode->i_mapping; file->f_flags = flags & (O_ACCMODE | O_NONBLOCK); file->private_data = priv; + drop_nlink(inode); return file; diff --git a/include/linux/migrate.h b/include/linux/migrate.h index a405d3dc..b6f3289 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm, extern void migrate_page_copy(struct page *newpage, struct page *page); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); +extern int migrate_page_move_mapping(struct address_space *mapping, + struct page *newpage, struct page *page, + struct buffer_head *head, enum migrate_mode mode); #else static inline void putback_lru_pages(struct list_head *l) {} diff --git a/mm/migrate.c b/mm/migrate.c index 27ed225..ac9c3a9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, * 2 for pages with a mapping * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. */ -static int migrate_page_move_mapping(struct address_space *mapping, +int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, struct buffer_head *head, enum migrate_mode mode) { diff --git a/mm/swap.c b/mm/swap.c index dfd7d71..bbfba0a 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -160,6 +160,7 @@ skip_lock_tail: void put_page(struct page *page) { + BUG_ON(page_count(page) <= 0); if (unlikely(PageCompound(page))) put_compound_page(page); else if (put_page_testzero(page)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gu Zheng Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Fri, 28 Jun 2013 17:24:25 +0800 Message-ID: <51CD5649.8040408@cn.fujitsu.com> References: <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Tang Chen , Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise Return-path: In-Reply-To: <20130611144525.GB14404@kvack.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 06/11/2013 10:45 PM, Benjamin LaHaise wrote: > Hi Tang, > > On Tue, Jun 11, 2013 at 05:42:31PM +0800, Tang Chen wrote: >> Hi Benjamin, >> >> Are you still working on this problem ? >> >> Thanks. :) > > Below is a copy of the most recent version of this patch I have worked > on. This version works and stands up to my testing using move_pages() to > force the migration of the aio ring buffer. A test program is available > at http://www.kvack.org/~bcrl/aio/aio-numa-test.c . Please note that > this version is not suitable for mainline as the modifactions to the > anon inode code are undesirable, so that part needs reworking. Hi Ben, Are you still working on this patch? As you know, using the current anon inode will lead to more than one instance of aio can not work. Have you found a way to fix this issue? Or can we use some other ones to replace the anon inode? Thanks, Gu > > -ben > > > fs/aio.c | 113 ++++++++++++++++++++++++++++++++++++++++++++---- > fs/anon_inodes.c | 14 ++++- > include/linux/migrate.h | 3 + > mm/migrate.c | 2 > mm/swap.c | 1 > 5 files changed, 121 insertions(+), 12 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index c5b1a8c..a951690 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -35,6 +35,9 @@ > #include > #include > #include > +#include > +#include > +#include > > #include > #include > @@ -108,6 +111,7 @@ struct kioctx { > } ____cacheline_aligned_in_smp; > > struct page *internal_pages[AIO_RING_PAGES]; > + struct file *ctx_file; > }; > > /*------ sysctl variables----*/ > @@ -136,18 +140,80 @@ __initcall(aio_setup); > > static void aio_free_ring(struct kioctx *ctx) > { > - long i; > - > - for (i = 0; i < ctx->nr_pages; i++) > - put_page(ctx->ring_pages[i]); > + int i; > > if (ctx->mmap_size) > vm_munmap(ctx->mmap_base, ctx->mmap_size); > > + if (ctx->ctx_file) > + truncate_setsize(ctx->ctx_file->f_inode, 0); > + > + for (i = 0; i < ctx->nr_pages; i++) { > + pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, > + page_count(ctx->ring_pages[i])); > + put_page(ctx->ring_pages[i]); > + } > + > if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) > kfree(ctx->ring_pages); > + > + if (ctx->ctx_file) { > + truncate_setsize(ctx->ctx_file->f_inode, 0); > + pr_debug("pid(%d) i_nlink=%u d_count=%d, d_unhashed=%d i_count=%d\n", > + current->pid, ctx->ctx_file->f_inode->i_nlink, > + ctx->ctx_file->f_path.dentry->d_count, > + d_unhashed(ctx->ctx_file->f_path.dentry), > + atomic_read(&ctx->ctx_file->f_path.dentry->d_inode->i_count)); > + fput(ctx->ctx_file); > + ctx->ctx_file = NULL; > + } > +} > + > +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + vma->vm_ops = &generic_file_vm_ops; > + return 0; > +} > + > +static const struct file_operations aio_ctx_fops = { > + .mmap = aio_ctx_mmap, > +}; > + > +static int aio_set_page_dirty(struct page *page) > +{ > + return 0; > +} > + > +static int aio_migratepage(struct address_space *mapping, struct page *new, > + struct page *old, enum migrate_mode mode) > +{ > + struct kioctx *ctx = mapping->private_data; > + unsigned long flags; > + unsigned idx = old->index; > + int rc; > + > + BUG_ON(PageWriteback(old)); /* Writeback must be complete */ > + put_page(old); > + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); > + if (rc != MIGRATEPAGE_SUCCESS) { > + get_page(old); > + return rc; > + } > + get_page(new); > + > + spin_lock_irqsave(&ctx->completion_lock, flags); > + migrate_page_copy(new, old); > + ctx->ring_pages[idx] = new; > + spin_unlock_irqrestore(&ctx->completion_lock, flags); > + > + return MIGRATEPAGE_SUCCESS; > } > > +static const struct address_space_operations aio_ctx_aops = { > + .set_page_dirty = aio_set_page_dirty, > + .migratepage = aio_migratepage, > +}; > + > static int aio_setup_ring(struct kioctx *ctx) > { > struct aio_ring *ring; > @@ -155,6 +221,7 @@ static int aio_setup_ring(struct kioctx *ctx) > struct mm_struct *mm = current->mm; > unsigned long size, populate; > int nr_pages; > + int i; > > /* Compensate for the ring buffer's head/tail overlap entry */ > nr_events += 2; /* 1 is required, 2 for good luck */ > @@ -166,6 +233,28 @@ static int aio_setup_ring(struct kioctx *ctx) > if (nr_pages < 0) > return -EINVAL; > > + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); > + if (IS_ERR(ctx->ctx_file)) { > + ctx->ctx_file = NULL; > + return -EAGAIN; > + } > + ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops; > + ctx->ctx_file->f_inode->i_mapping->private_data = ctx; > + ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages; > + > + for (i=0; i + struct page *page; > + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping, > + i, GFP_HIGHUSER | __GFP_ZERO); > + if (!page) > + break; > + pr_debug("pid(%d) page[%d]->count=%d\n", > + current->pid, i, page_count(page)); > + SetPageUptodate(page); > + SetPageDirty(page); > + unlock_page(page); > + } > + > nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); > > ctx->nr_events = 0; > @@ -180,20 +269,25 @@ static int aio_setup_ring(struct kioctx *ctx) > ctx->mmap_size = nr_pages * PAGE_SIZE; > pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size); > down_write(&mm->mmap_sem); > - ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size, > - PROT_READ|PROT_WRITE, > - MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate); > + ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size, > + PROT_READ | PROT_WRITE, > + MAP_SHARED | MAP_POPULATE, 0, > + &populate); > if (IS_ERR((void *)ctx->mmap_base)) { > up_write(&mm->mmap_sem); > ctx->mmap_size = 0; > aio_free_ring(ctx); > return -EAGAIN; > } > + up_write(&mm->mmap_sem); > + mm_populate(ctx->mmap_base, populate); > > pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); > ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, > 1, 0, ctx->ring_pages, NULL); > - up_write(&mm->mmap_sem); > + for (i=0; inr_pages; i++) { > + put_page(ctx->ring_pages[i]); > + } > > if (unlikely(ctx->nr_pages != nr_pages)) { > aio_free_ring(ctx); > @@ -403,6 +497,8 @@ out_cleanup: > err = -EAGAIN; > aio_free_ring(ctx); > out_freectx: > + if (ctx->ctx_file) > + fput(ctx->ctx_file); > kmem_cache_free(kioctx_cachep, ctx); > pr_debug("error allocating ioctx %d\n", err); > return ERR_PTR(err); > @@ -852,6 +948,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) > ioctx = ioctx_alloc(nr_events); > ret = PTR_ERR(ioctx); > if (!IS_ERR(ioctx)) { > + ctx = ioctx->user_id; > ret = put_user(ioctx->user_id, ctxp); > if (ret) > kill_ioctx(ioctx); > diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c > index 47a65df..376d289 100644 > --- a/fs/anon_inodes.c > +++ b/fs/anon_inodes.c > @@ -131,6 +131,7 @@ struct file *anon_inode_getfile(const char *name, > struct qstr this; > struct path path; > struct file *file; > + struct inode *inode; > > if (IS_ERR(anon_inode_inode)) > return ERR_PTR(-ENODEV); > @@ -138,6 +139,12 @@ struct file *anon_inode_getfile(const char *name, > if (fops->owner && !try_module_get(fops->owner)) > return ERR_PTR(-ENOENT); > > + inode = anon_inode_mkinode(anon_inode_inode->i_sb); > + if (IS_ERR(inode)) { > + file = ERR_PTR(-ENOMEM); > + goto err_module; > + } > + > /* > * Link the inode to a directory entry by creating a unique name > * using the inode sequence number. > @@ -155,17 +162,18 @@ struct file *anon_inode_getfile(const char *name, > * We know the anon_inode inode count is always greater than zero, > * so ihold() is safe. > */ > - ihold(anon_inode_inode); > + //ihold(inode); > > - d_instantiate(path.dentry, anon_inode_inode); > + d_instantiate(path.dentry, inode); > > file = alloc_file(&path, OPEN_FMODE(flags), fops); > if (IS_ERR(file)) > goto err_dput; > - file->f_mapping = anon_inode_inode->i_mapping; > + file->f_mapping = inode->i_mapping; > > file->f_flags = flags & (O_ACCMODE | O_NONBLOCK); > file->private_data = priv; > + drop_nlink(inode); > > return file; > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index a405d3dc..b6f3289 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm, > extern void migrate_page_copy(struct page *newpage, struct page *page); > extern int migrate_huge_page_move_mapping(struct address_space *mapping, > struct page *newpage, struct page *page); > +extern int migrate_page_move_mapping(struct address_space *mapping, > + struct page *newpage, struct page *page, > + struct buffer_head *head, enum migrate_mode mode); > #else > > static inline void putback_lru_pages(struct list_head *l) {} > diff --git a/mm/migrate.c b/mm/migrate.c > index 27ed225..ac9c3a9 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, > * 2 for pages with a mapping > * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. > */ > -static int migrate_page_move_mapping(struct address_space *mapping, > +int migrate_page_move_mapping(struct address_space *mapping, > struct page *newpage, struct page *page, > struct buffer_head *head, enum migrate_mode mode) > { > diff --git a/mm/swap.c b/mm/swap.c > index dfd7d71..bbfba0a 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -160,6 +160,7 @@ skip_lock_tail: > > void put_page(struct page *page) > { > + BUG_ON(page_count(page) <= 0); > if (unlikely(PageCompound(page))) > put_compound_page(page); > else if (put_page_testzero(page)) > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gu Zheng Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Mon, 01 Jul 2013 15:23:39 +0800 Message-ID: <51D12E7B.6080301@cn.fujitsu.com> References: <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise Return-path: In-Reply-To: <20130611144525.GB14404@kvack.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 06/11/2013 10:45 PM, Benjamin LaHaise wrote: > Hi Tang, > > On Tue, Jun 11, 2013 at 05:42:31PM +0800, Tang Chen wrote: >> Hi Benjamin, >> >> Are you still working on this problem ? >> >> Thanks. :) > > Below is a copy of the most recent version of this patch I have worked > on. This version works and stands up to my testing using move_pages() to > force the migration of the aio ring buffer. A test program is available > at http://www.kvack.org/~bcrl/aio/aio-numa-test.c . Please note that > this version is not suitable for mainline as the modifactions to the > anon inode code are undesirable, so that part needs reworking. Hi Ben, Are you still working on this patch? As you know, using the current anon inode will lead to more than one instance of aio can not work. Have you found a way to fix this issue? Or can we use some other ones to replace the anon inode? Thanks, Gu > > -ben > > > fs/aio.c | 113 ++++++++++++++++++++++++++++++++++++++++++++---- > fs/anon_inodes.c | 14 ++++- > include/linux/migrate.h | 3 + > mm/migrate.c | 2 > mm/swap.c | 1 > 5 files changed, 121 insertions(+), 12 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index c5b1a8c..a951690 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -35,6 +35,9 @@ > #include > #include > #include > +#include > +#include > +#include > > #include > #include > @@ -108,6 +111,7 @@ struct kioctx { > } ____cacheline_aligned_in_smp; > > struct page *internal_pages[AIO_RING_PAGES]; > + struct file *ctx_file; > }; > > /*------ sysctl variables----*/ > @@ -136,18 +140,80 @@ __initcall(aio_setup); > > static void aio_free_ring(struct kioctx *ctx) > { > - long i; > - > - for (i = 0; i < ctx->nr_pages; i++) > - put_page(ctx->ring_pages[i]); > + int i; > > if (ctx->mmap_size) > vm_munmap(ctx->mmap_base, ctx->mmap_size); > > + if (ctx->ctx_file) > + truncate_setsize(ctx->ctx_file->f_inode, 0); > + > + for (i = 0; i < ctx->nr_pages; i++) { > + pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, > + page_count(ctx->ring_pages[i])); > + put_page(ctx->ring_pages[i]); > + } > + > if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) > kfree(ctx->ring_pages); > + > + if (ctx->ctx_file) { > + truncate_setsize(ctx->ctx_file->f_inode, 0); > + pr_debug("pid(%d) i_nlink=%u d_count=%d, d_unhashed=%d i_count=%d\n", > + current->pid, ctx->ctx_file->f_inode->i_nlink, > + ctx->ctx_file->f_path.dentry->d_count, > + d_unhashed(ctx->ctx_file->f_path.dentry), > + atomic_read(&ctx->ctx_file->f_path.dentry->d_inode->i_count)); > + fput(ctx->ctx_file); > + ctx->ctx_file = NULL; > + } > +} > + > +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + vma->vm_ops = &generic_file_vm_ops; > + return 0; > +} > + > +static const struct file_operations aio_ctx_fops = { > + .mmap = aio_ctx_mmap, > +}; > + > +static int aio_set_page_dirty(struct page *page) > +{ > + return 0; > +} > + > +static int aio_migratepage(struct address_space *mapping, struct page *new, > + struct page *old, enum migrate_mode mode) > +{ > + struct kioctx *ctx = mapping->private_data; > + unsigned long flags; > + unsigned idx = old->index; > + int rc; > + > + BUG_ON(PageWriteback(old)); /* Writeback must be complete */ > + put_page(old); > + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); > + if (rc != MIGRATEPAGE_SUCCESS) { > + get_page(old); > + return rc; > + } > + get_page(new); > + > + spin_lock_irqsave(&ctx->completion_lock, flags); > + migrate_page_copy(new, old); > + ctx->ring_pages[idx] = new; > + spin_unlock_irqrestore(&ctx->completion_lock, flags); > + > + return MIGRATEPAGE_SUCCESS; > } > > +static const struct address_space_operations aio_ctx_aops = { > + .set_page_dirty = aio_set_page_dirty, > + .migratepage = aio_migratepage, > +}; > + > static int aio_setup_ring(struct kioctx *ctx) > { > struct aio_ring *ring; > @@ -155,6 +221,7 @@ static int aio_setup_ring(struct kioctx *ctx) > struct mm_struct *mm = current->mm; > unsigned long size, populate; > int nr_pages; > + int i; > > /* Compensate for the ring buffer's head/tail overlap entry */ > nr_events += 2; /* 1 is required, 2 for good luck */ > @@ -166,6 +233,28 @@ static int aio_setup_ring(struct kioctx *ctx) > if (nr_pages < 0) > return -EINVAL; > > + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); > + if (IS_ERR(ctx->ctx_file)) { > + ctx->ctx_file = NULL; > + return -EAGAIN; > + } > + ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops; > + ctx->ctx_file->f_inode->i_mapping->private_data = ctx; > + ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages; > + > + for (i=0; i + struct page *page; > + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping, > + i, GFP_HIGHUSER | __GFP_ZERO); > + if (!page) > + break; > + pr_debug("pid(%d) page[%d]->count=%d\n", > + current->pid, i, page_count(page)); > + SetPageUptodate(page); > + SetPageDirty(page); > + unlock_page(page); > + } > + > nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); > > ctx->nr_events = 0; > @@ -180,20 +269,25 @@ static int aio_setup_ring(struct kioctx *ctx) > ctx->mmap_size = nr_pages * PAGE_SIZE; > pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size); > down_write(&mm->mmap_sem); > - ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size, > - PROT_READ|PROT_WRITE, > - MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate); > + ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size, > + PROT_READ | PROT_WRITE, > + MAP_SHARED | MAP_POPULATE, 0, > + &populate); > if (IS_ERR((void *)ctx->mmap_base)) { > up_write(&mm->mmap_sem); > ctx->mmap_size = 0; > aio_free_ring(ctx); > return -EAGAIN; > } > + up_write(&mm->mmap_sem); > + mm_populate(ctx->mmap_base, populate); > > pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); > ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, > 1, 0, ctx->ring_pages, NULL); > - up_write(&mm->mmap_sem); > + for (i=0; inr_pages; i++) { > + put_page(ctx->ring_pages[i]); > + } > > if (unlikely(ctx->nr_pages != nr_pages)) { > aio_free_ring(ctx); > @@ -403,6 +497,8 @@ out_cleanup: > err = -EAGAIN; > aio_free_ring(ctx); > out_freectx: > + if (ctx->ctx_file) > + fput(ctx->ctx_file); > kmem_cache_free(kioctx_cachep, ctx); > pr_debug("error allocating ioctx %d\n", err); > return ERR_PTR(err); > @@ -852,6 +948,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) > ioctx = ioctx_alloc(nr_events); > ret = PTR_ERR(ioctx); > if (!IS_ERR(ioctx)) { > + ctx = ioctx->user_id; > ret = put_user(ioctx->user_id, ctxp); > if (ret) > kill_ioctx(ioctx); > diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c > index 47a65df..376d289 100644 > --- a/fs/anon_inodes.c > +++ b/fs/anon_inodes.c > @@ -131,6 +131,7 @@ struct file *anon_inode_getfile(const char *name, > struct qstr this; > struct path path; > struct file *file; > + struct inode *inode; > > if (IS_ERR(anon_inode_inode)) > return ERR_PTR(-ENODEV); > @@ -138,6 +139,12 @@ struct file *anon_inode_getfile(const char *name, > if (fops->owner && !try_module_get(fops->owner)) > return ERR_PTR(-ENOENT); > > + inode = anon_inode_mkinode(anon_inode_inode->i_sb); > + if (IS_ERR(inode)) { > + file = ERR_PTR(-ENOMEM); > + goto err_module; > + } > + > /* > * Link the inode to a directory entry by creating a unique name > * using the inode sequence number. > @@ -155,17 +162,18 @@ struct file *anon_inode_getfile(const char *name, > * We know the anon_inode inode count is always greater than zero, > * so ihold() is safe. > */ > - ihold(anon_inode_inode); > + //ihold(inode); > > - d_instantiate(path.dentry, anon_inode_inode); > + d_instantiate(path.dentry, inode); > > file = alloc_file(&path, OPEN_FMODE(flags), fops); > if (IS_ERR(file)) > goto err_dput; > - file->f_mapping = anon_inode_inode->i_mapping; > + file->f_mapping = inode->i_mapping; > > file->f_flags = flags & (O_ACCMODE | O_NONBLOCK); > file->private_data = priv; > + drop_nlink(inode); > > return file; > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index a405d3dc..b6f3289 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm, > extern void migrate_page_copy(struct page *newpage, struct page *page); > extern int migrate_huge_page_move_mapping(struct address_space *mapping, > struct page *newpage, struct page *page); > +extern int migrate_page_move_mapping(struct address_space *mapping, > + struct page *newpage, struct page *page, > + struct buffer_head *head, enum migrate_mode mode); > #else > > static inline void putback_lru_pages(struct list_head *l) {} > diff --git a/mm/migrate.c b/mm/migrate.c > index 27ed225..ac9c3a9 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, > * 2 for pages with a mapping > * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. > */ > -static int migrate_page_move_mapping(struct address_space *mapping, > +int migrate_page_move_mapping(struct address_space *mapping, > struct page *newpage, struct page *page, > struct buffer_head *head, enum migrate_mode mode) > { > diff --git a/mm/swap.c b/mm/swap.c > index dfd7d71..bbfba0a 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -160,6 +160,7 @@ skip_lock_tail: > > void put_page(struct page *page) > { > + BUG_ON(page_count(page) <= 0); > if (unlikely(PageCompound(page))) > put_compound_page(page); > else if (put_page_testzero(page)) > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin LaHaise Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Tue, 2 Jul 2013 14:00:08 -0400 Message-ID: <20130702180008.GQ16399@kvack.org> References: <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Gu Zheng Return-path: Content-Disposition: inline In-Reply-To: <51D12E7B.6080301@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote: > Hi Ben, > Are you still working on this patch? > As you know, using the current anon inode will lead to more than one instance of > aio can not work. Have you found a way to fix this issue? Or can we use some > other ones to replace the anon inode? This patch hasn't been a high priority for me. I would really appreciate it if someone could confirm that this patch does indeed fix the hotplug page migration issue by testing it in a system that hits the bug. Removing the anon_inode bits isn't too much work, but I'd just like to have some confirmation that this fix is considered to be "good enough" for the problem at hand before spending any further time on it. There was talk of using another approach, but it's not clear if there was any progress. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gu Zheng Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Wed, 03 Jul 2013 09:53:33 +0800 Message-ID: <51D3841D.9040906@cn.fujitsu.com> References: <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise Return-path: In-Reply-To: <20130702180008.GQ16399@kvack.org> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 07/03/2013 02:00 AM, Benjamin LaHaise wrote: > On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote: >> Hi Ben, >> Are you still working on this patch? >> As you know, using the current anon inode will lead to more than one instance of >> aio can not work. Have you found a way to fix this issue? Or can we use some >> other ones to replace the anon inode? > > This patch hasn't been a high priority for me. I would really appreciate > it if someone could confirm that this patch does indeed fix the hotplug > page migration issue by testing it in a system that hits the bug. Removing > the anon_inode bits isn't too much work, but I'd just like to have some > confirmation that this fix is considered to be "good enough" for the > problem at hand before spending any further time on it. There was talk of > using another approach, but it's not clear if there was any progress. Yeah, we have not seen anyone try to fix this issue using the other approach we talked. I'm not sure whether your patch can indeed fix the problem, but I'll carry out a complete test to confirm it, and I'll be very glad to continue this job based on your patch if you do not have enough time working on it.:) Thanks, Gu > > -ben -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gu Zheng Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Thu, 04 Jul 2013 14:51:18 +0800 Message-ID: <51D51B66.3000301@cn.fujitsu.com> References: <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise Return-path: In-Reply-To: <20130702180008.GQ16399@kvack.org> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 07/03/2013 02:00 AM, Benjamin LaHaise wrote: > On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote: >> Hi Ben, >> Are you still working on this patch? >> As you know, using the current anon inode will lead to more than one instance of >> aio can not work. Have you found a way to fix this issue? Or can we use some >> other ones to replace the anon inode? > > This patch hasn't been a high priority for me. I would really appreciate > it if someone could confirm that this patch does indeed fix the hotplug > page migration issue by testing it in a system that hits the bug. Removing > the anon_inode bits isn't too much work, but I'd just like to have some > confirmation that this fix is considered to be "good enough" for the > problem at hand before spending any further time on it. There was talk of > using another approach, but it's not clear if there was any progress. Hi Ben, When I test your patch on kernel 3.10, the kernel panic when aio job complete or exit, exactly in aio_free_ring(), the following is a part of dmesg. Thanks, Gu kernel BUG at mm/swap.c:163! invalid opcode: 0000 [#1] SMP Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF 3.10.0-aio-migrate+ #107 Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 Workqueue: events kill_ioctx_work task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 RIP: 0010:[] [] put_page+0x48/0x60 RSP: 0018:ffff8807dda99cd8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8807be1f1e00 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea001b196c80 RBP: ffff8807dda99cd8 R08: 0000000000000000 R09: 0000000000000000 R10: ffff8807ffbb5f00 R11: 000000000000005a R12: 0000000000000001 R13: 0000000000000000 R14: ffff8807dda974e0 R15: ffff8807be1f1ec8 FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003b826dc7d0 CR3: 0000000001a0b000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8807dda99d18 ffffffff811b11f6 0000000000000000 0000000200000000 ffff8807be1f1e00 ffff8807be1f1e80 000000000000000c 0000000000000000 ffff8807dda99dc8 ffffffff811b21a2 00000001000438ec ffff8807fd692d00 Call Trace: [] aio_free_ring+0x96/0x1c0 [] free_ioctx+0x1f2/0x250 [] ? idle_balance+0xed/0x140 [] put_ioctx+0x1a/0x30 [] kill_ioctx_work+0x2f/0x40 [] process_one_work+0x183/0x490 [] worker_thread+0x120/0x3a0 [] ? manage_workers+0x160/0x160 [] kthread+0xce/0xe0 [] ? kthread_freezable_should_stop+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? kthread_freezable_should_stop+0x70/0x70 Code: 07 00 c0 75 1f f0 ff 4f 1c 0f 94 c0 84 c0 75 0b c9 66 90 c3 0f 1f 80 00 00 00 00 e8 53 fe ff ff c9 66 90 c3 e8 7a fe ff ff c9 c3 <0f> 0b 66 0f 1f 44 00 00 eb f8 48 8b 47 30 eb bc 0f 1f 84 00 00 RIP [] put_page+0x48/0x60 RSP ---[ end trace b5e2c17407c840d8 ]--- Jul 4 15:49:50 BUG: unable to handle kernel paging request at ffffffffffffffd8 IP: [] kthread_data+0x10/0x20 PGD 1a0c067 PUD 1a0e067 PMD 0 Oops: 0000 [#2] SMP Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF D 3.10.0-aio-migrate+ #107 Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 RIP: 0010:[] [] kthread_data+0x10/0x20 RSP: 0018:ffff8807dda999b8 EFLAGS: 00010092 RAX: 0000000000000000 RBX: 0000000000000004 RCX: ffffffff81da3ea0 RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8807dda974e0 RBP: ffff8807dda999b8 R08: ffff8807dda97550 R09: 0000000000000006 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 R13: ffff8807dda97ab8 R14: 0000000000000001 R15: 0000000000000006 FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000001a0b000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8807dda999d8 ffffffff8105e155 ffff8807dda999d8 ffff8807fd692d00 ffff8807dda99a68 ffffffff8154168b ffff8807dda99fd8 0000000000012d00 ffff8807dda98010 0000000000012d00 0000000000012d00 0000000000012d00 Call Trace: [] wq_worker_sleeping+0x15/0xa0 [] __schedule+0x5ab/0x6f0 [] ? put_io_context_active+0xc2/0xf0 [] schedule+0x29/0x70 [] do_exit+0x2d5/0x480 [] oops_end+0xa9/0xf0 [] die+0x5b/0x90 [] do_trap+0xcb/0x170 [] ? __atomic_notifier_call_chain+0x12/0x20 [] do_invalid_op+0x95/0xb0 [] ? put_page+0x48/0x60 [] ? truncate_inode_pages_range+0x201/0x500 [] invalid_op+0x18/0x20 [] ? put_page+0x48/0x60 [] ? truncate_setsize+0x19/0x20 [] aio_free_ring+0x96/0x1c0 [] free_ioctx+0x1f2/0x250 [] ? idle_balance+0xed/0x140 [] put_ioctx+0x1a/0x30 [] kill_ioctx_work+0x2f/0x40 [] process_one_work+0x183/0x490 [] worker_thread+0x120/0x3a0 [] ? manage_workers+0x160/0x160 [] kthread+0xce/0xe0 [] ? kthread_freezable_should_stop+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? kthread_freezable_should_stop+0x70/0x70 Code: 80 05 00 00 48 8b 40 c8 c9 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 80 05 00 00 <48> 8b 40 d8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 RIP [] kthread_data+0x10/0x20 RSP CR2: ffffffffffffffd8 ---[ end trace b5e2c17407c840d9 ]--- DP kernel: -----Fixing recursive fault but reboot is needed! -------[ cut here ]------------ Jul 4 15:49:50 DP kernel: kernel BUG at mm/swap.c:163! Jul 4 15:49:50 DP kernel: invalid opcode: 0000 [#1] SMP Jul 4 15:49:50 DP kernel: Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) Jul 4 15:49:50 DP kernel: CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF 3.10.0-aio-migrate+ #107 Jul 4 15:49:50 DP kernel: Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 Jul 4 15:49:50 DP kernel: Workqueue: events kill_ioctx_work Jul 4 15:49:50 DP kernel: task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 Jul 4 15:49:50 DP kernel: RIP: 0010:[] [] put_page+0x48/0x60 Jul 4 15:49:50 DP kernel: RSP: 0018:ffff8807dda99cd8 EFLAGS: 00010246 Jul 4 15:49:50 DP kernel: RAX: 0000000000000000 RBX: ffff8807be1f1e00 RCX: 0000000000000001 Jul 4 15:49:50 DP kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea001b196c80 Jul 4 15:49:50 DP kernel: RBP: ffff8807dda99cd8 R08: 0000000000000000 R09: 0000000000000000 Jul 4 15:49:50 DP kernel: R10: ffff8807ffbb5f00 R11: 000000000000005a R12: 0000000000000001 Jul 4 15:49:50 DP kernel: R13: 0000000000000000 R14: ffff8807dda974e0 R15: ffff8807be1f1ec8 Jul 4 15:49:50 DP kernel: FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 Jul 4 15:49:50 DP kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jul 4 15:49:50 DP kernel: CR2: 0000003b826dc7d0 CR3: 0000000001a0b000 CR4: 00000000000007e0 Jul 4 15:49:50 DP kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 4 15:49:50 DP kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jul 4 15:49:50 DP kernel: Stack: Jul 4 15:49:50 DP kernel: ffff8807dda99d18 ffffffff811b11f6 0000000000000000 0000000200000000 Jul 4 15:49:50 DP kernel: ffff8807be1f1e00 ffff8807be1f1e80 000000000000000c 0000000000000000 Jul 4 15:49:50 DP kernel: ffff8807dda99dc8 ffffffff811b21a2 00000001000438ec ffff8807fd692d00 Jul 4 15:49:50 DP kernel: Call Trace: Jul 4 15:49:50 DP kernel: [] aio_free_ring+0x96/0x1c0 Jul 4 15:49:50 DP kernel: [] free_ioctx+0x1f2/0x250 Jul 4 15:49:50 DP kernel: [] ? idle_balance+0xed/0x140 Jul 4 15:49:50 DP kernel: [] put_ioctx+0x1a/0x30 Jul 4 15:49:50 DP kernel: [] kill_ioctx_work+0x2f/0x40 Jul 4 15:49:50 DP kernel: [] process_one_work+0x183/0x490 Jul 4 15:49:50 DP kernel: [] worker_thread+0x120/0x3a0 Jul 4 15:49:50 DP kernel: [] ? manage_workers+0x160/0x160 Jul 4 15:49:50 DP kernel: [] kthread+0xce/0xe0 Jul 4 15:49:50 DP kernel: [] ? kthread_freezable_should_stop+0x70/0x70 Jul 4 15:49:50 DP kernel: [] ret_from_fork+0x7c/0xb0 Jul 4 15:49:50 DP kernel: [] ? kthread_freezable_should_stop+0x70/0x70 Jul 4 15:49:50 DP kernel: Code: 07 00 c0 75 1f f0 ff 4f 1c 0f 94 c0 84 c0 75 0b c9 66 90 c3 0f 1f 80 00 00 00 00 e8 53 fe ff ff c9 66 90 c3 e8 7a fe ff ff c9 c3 <0f> 0b 66 0f 1f 44 00 00 eb f8 48 8b 47 30 eb bc 0f 1f 84 00 00 Jul 4 15:49:50 DP kernel: RIP [] put_page+0x48/0x60 Jul 4 15:49:50 DP kernel: RSP Jul 4 15:49:50 DP kernel: ---[ end trace b5e2c17407c840d8 ]--- INFO: rcu_sched detected stalls on CPUs/tasks: { 4} (detected by 9, t=21056 jiffies, g=4158, c=4157, q=1040) sending NMI to all CPUs: NMI backtrace for cpu 4 CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF D 3.10.0-aio-migrate+ #107 Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 RIP: 0010:[] [] _raw_spin_lock_irq+0x22/0x30 RSP: 0018:ffff8807dda99618 EFLAGS: 00000002 RAX: 000000000000497c RBX: ffff8807fd692d00 RCX: ffff8807dda98010 RDX: 000000000000497e RSI: ffffffff815419a9 RDI: ffff8807fd692d00 RBP: ffff8807dda99618 R08: 0000000000000004 R09: 0000000000000100 R10: 00000000000009fe R11: 00000000000009fe R12: 0000000000000004 R13: 0000000000000009 R14: 0000000000000009 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000001a0b000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8807dda996a8 ffffffff815411b6 ffff8807dda99fd8 0000000000012d00 ffff8807dda98010 0000000000012d00 0000000000012d00 0000000000012d00 ffff8807dda99fd8 0000000000012d00 ffff8807dda974e0 ffff8807dda996c8 Call Trace: [] __schedule+0xd6/0x6f0 [] schedule+0x29/0x70 [] do_exit+0x42a/0x480 [] oops_end+0xa9/0xf0 [] no_context+0x11e/0x1f0 [] __bad_area_nosemaphore+0x11d/0x220 [] bad_area_nosemaphore+0x13/0x20 [] __do_page_fault+0xc5/0x490 [] ? call_rcu_sched+0x17/0x20 [] ? strlcpy+0x4a/0x60 [] do_page_fault+0xe/0x10 [] page_fault+0x22/0x30 [] ? kthread_data+0x10/0x20 [] wq_worker_sleeping+0x15/0xa0 [] __schedule+0x5ab/0x6f0 [] ? put_io_context_active+0xc2/0xf0 [] schedule+0x29/0x70 [] do_exit+0x2d5/0x480 [] oops_end+0xa9/0xf0 [] die+0x5b/0x90 [] do_trap+0xcb/0x170 [] ? __atomic_notifier_call_chain+0x12/0x20 [] do_invalid_op+0x95/0xb0 [] ? put_page+0x48/0x60 [] ? truncate_inode_pages_range+0x201/0x500 [] invalid_op+0x18/0x20 [] ? put_page+0x48/0x60 [] ? truncate_setsize+0x19/0x20 [] aio_free_ring+0x96/0x1c0 [] free_ioctx+0x1f2/0x250 [] ? idle_balance+0xed/0x140 [] put_ioctx+0x1a/0x30 [] kill_ioctx_work+0x2f/0x40 [] process_one_work+0x183/0x490 [] worker_thread+0x120/0x3a0 [] ? manage_workers+0x160/0x160 [] kthread+0xce/0xe0 [] ? kthread_freezable_should_stop+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? kthread_freezable_should_stop+0x70/0x70 Code: 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 fa b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0d 0f 1f 00 f3 90 <0f> b7 07 66 39 c2 75 f6 c9 c3 0f 1f 40 00 55 48 89 e5 66 66 66 NMI backtrace for cpu 1 > > -ben -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin LaHaise Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Thu, 4 Jul 2013 07:41:53 -0400 Message-ID: <20130704114153.GD11006@kvack.org> References: <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> <51D51B66.3000301@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Gu Zheng Return-path: Content-Disposition: inline In-Reply-To: <51D51B66.3000301@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Thu, Jul 04, 2013 at 02:51:18PM +0800, Gu Zheng wrote: > Hi Ben, > When I test your patch on kernel 3.10, the kernel panic when aio job > complete or exit, exactly in aio_free_ring(), the following is a part of dmesg. What is your test case? -ben -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gu Zheng Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Date: Fri, 05 Jul 2013 11:21:00 +0800 Message-ID: <51D63B9C.4060204@cn.fujitsu.com> References: <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> <51D51B66.3000301@cn.fujitsu.com> <20130704114153.GD11006@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski To: Benjamin LaHaise Return-path: In-Reply-To: <20130704114153.GD11006@kvack.org> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 07/04/2013 07:41 PM, Benjamin LaHaise wrote: > On Thu, Jul 04, 2013 at 02:51:18PM +0800, Gu Zheng wrote: >> Hi Ben, >> When I test your patch on kernel 3.10, the kernel panic when aio job >> complete or exit, exactly in aio_free_ring(), the following is a part of dmesg. > > What is your test case? Just the one you mentioned in the previous mail: http://www.kvack.org/~bcrl/aio/aio-numa-test.c Thanks, Gu > > -ben > -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Lin Feng Subject: [PATCH V2 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Date: Tue, 5 Feb 2013 17:21:51 +0800 Message-Id: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng Currently get_user_pages() always tries to allocate pages from movable zone, as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users of get_user_pages() is easy to pin user pages for a long time(for now we found that pages pinned as aio ring pages is such case), which is fatal for memory hotplug/remove framework. So the 1st patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. The 2nd patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works when configed with CONFIG_MEMORY_HOTREMOVE, otherwise it falls back to use the old version of get_user_pages(). --- ChangeLog v1->v2: Patch1: - Fix the negative return value bug pointed out by Andrew and other suggestions pointed out by Andrew and Jeff. Patch2: - Kill the CONFIG_MEMORY_HOTREMOVE dependence suggested by Jeff. --- Lin Feng (2): mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove fs/aio.c | 4 +- include/linux/mm.h | 3 ++ include/linux/mmzone.h | 4 ++ mm/memory.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 +++ 5 files changed, 97 insertions(+), 2 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 5 Feb 2013 12:01:37 +0000 From: Mel Gorman Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130205120137.GG21389@suse.de> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Lin Feng Cc: akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: > get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > > This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > Cc: Andrew Morton > Cc: Mel Gorman > Cc: KAMEZAWA Hiroyuki > Cc: Yasuaki Ishimatsu > Cc: Jeff Moyer > Cc: Minchan Kim > Cc: Zach Brown > Reviewed-by: Tang Chen > Reviewed-by: Gu Zheng > Signed-off-by: Lin Feng I already had started the review of V1 before this was sent unfortunately. However, I think the feedback I gave for V1 is still valid so I'll wait for comments on that review before digging further. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 5 Feb 2013 19:52:17 -0500 From: Benjamin LaHaise Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130206005217.GJ20842@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130206004234.GD11197@blaptop> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Mel Gorman , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Wed, Feb 06, 2013 at 09:42:34AM +0900, Minchan Kim wrote: > THP degradation by increasing MIGRATE_UNMOVABLE? > Lin said most of GUP pages release the page in short so is it really problem? > Even in embedded, we don't use THP yet but CMA and GUP call would be not too often > but failing of CMA would be critical. > > I'd like to hear opinions. If aio was given a callback to migrate the pages on, it could just migrate the pages as needed. There's nothing fundamental preventing that approach. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 8 Feb 2013 11:32:37 +0900 From: Minchan Kim Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130208023237.GK11197@blaptop> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130206095617.GN21389@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Hi Mel, On Wed, Feb 06, 2013 at 09:56:17AM +0000, Mel Gorman wrote: > On Wed, Feb 06, 2013 at 09:42:34AM +0900, Minchan Kim wrote: > > On Tue, Feb 05, 2013 at 12:01:37PM +0000, Mel Gorman wrote: > > > On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: > > > > get_user_pages() always tries to allocate pages from movable zone, which is not > > > > reliable to memory hotremove framework in some case. > > > > > > > > This patch introduces a new library function called get_user_pages_non_movable() > > > > to pin pages only from zone non-movable in memory. > > > > It's a wrapper of get_user_pages() but it makes sure that all pages come from > > > > non-movable zone via additional page migration. > > > > > > > > Cc: Andrew Morton > > > > Cc: Mel Gorman > > > > Cc: KAMEZAWA Hiroyuki > > > > Cc: Yasuaki Ishimatsu > > > > Cc: Jeff Moyer > > > > Cc: Minchan Kim > > > > Cc: Zach Brown > > > > Reviewed-by: Tang Chen > > > > Reviewed-by: Gu Zheng > > > > Signed-off-by: Lin Feng > > > > > > I already had started the review of V1 before this was sent > > > unfortunately. However, I think the feedback I gave for V1 is still > > > valid so I'll wait for comments on that review before digging further. > > > > Mel, Andrew > > > > Sorry for making noise if you already confirmed the direction but I have a concern > > about that. > > I haven't confirmed any sort of direction, nor do I determine the > direction for memory hot-remove which I'm only paying vague attention to. > I stated a while ago that I think the use of ZONE_MOVABLE is a bad idea > for "guaranteeing" memory hot-remove and is already going the "wrong" > direction. That's just my opinion. > > This patch is about mitigating (but not solving) the problem of long-lived > pins. In the general case, about all I could think of for that is that the Agreed. > kernel would have to warn the administrator what applications had pinned > the memory and wait for the user to shut them down. To guarantee anything, > it would be necessary for subsystems to implement a callback for migration > to unpin pages, barrier operations until migration completes and pin the > new pfns. It could be applied for SUBSYSTEM but it's very hard for all DRIVER developer, and I doubt we can give them a common template most of driver developers can reuse it. > > > Because IMHO, we can't expect most of user for MEMORY_HOTPLUG will release > > pinned pages immediately. > > Indeed not, but it's not really what this patch is about. This patch is > about moving the pages before they get permanently pinned. It mitigates > the problem but does not solve it because there is no guarantee that the > driver pinning a page will flag it properly. True. And I doubt what memory-hotplug guys really want is best effort, not guarantee. Anway, CMA want to guarantee, even low latency and I hope this patch solves both memory-hotplug and CMA solve the problem. > > > In addtion, MEMORY_HOTPLUG could be used for embedded system > > for reducing power by PASR and some drivers in embedded could use GUP anytime and anywhere. > > They can't know in advance they will use pinned pages long time or release in short time > > because it depends on some event like user's response which is very not predetermined. > > True. This patch does not solve that problem. > > > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > > failing migration by page count and then, investigate they are really using GUP and > > it's REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? > > Within the context of this patch, that is their main option. Finding > who is holding the pin is a problem. For userspace-pinned buffers it's > straight-forward as rmap will identify what processes are holding the > pin (page->list vmas->mm, lookup all tasks until p->mm == mm) and report > that. For driver-related pins, it's not as straight-forward. I guess there True. > could be callback to give meaningful information on it but no guarantee > that drivers pinning pages will implement it. In that case all you could do Nod. > was dump page->mapping and punt it at a kernel developer to figure out the > responsible driver. This might be managable for memory hot-remove where > there is an administator but may not work at all for embedded users. Yeab. Even there are proprietary modules in embedded, we can't see soruce code. > > There is the possibility that callbacks could be introduced for > migrate_unpin() and migrate_pin() that takes a list of PFN pairs > (old,new). The unpin callback should release the old PFNs and barrier > against any operations until the migrate_pfn() callback is called with > the updated pfns to be repinned. Again it would fully depend on subsystems > implementing it properly. > > The callback interface would be more robust but puts a lot more work on > the driver side where your milage will vary. True. > > > Yes. it could be done but it would be rather trobulesome job. Even it couldn't be triggered > > during QE phase so that trouble doesn't end until all guys uses GUP_NM. > > Let's consider another case. Some driver pin the page in very short time > > so he decide to use GUP instead of GUP_NM but someday, someuser start to use the driver > > very often so although pinning time is very short, it could be forever pinning effect > > if the use calls it very often. In the end, we should change it with GUP_NM, again. > > IMHO, In future, we ends up changing most of GUP user with GUP_NM if CMA and MEMORY_HOTPLUG > > is available all over the world. > > > > Same thing, callbacks to unpin and barrier would handle such a case by > effectively freezing the driver or subsystem responsible for the page. > > > So, what's wrong if we replace get_user_pages with get_user_pages_non_movable > > in MEMORY_HOTPLUG/CMA without exposing get_user_pages_non_movable? > > > > I mean this > > > > #ifdef CONFIG_MIGRATE_ISOLATE > > int get_user_pages() > > { > > return __get_user_pages_non_movable(); > > } > > #else > > int get_user_pages() > > { > > return old_get_user_pages(); > > } > > #endif > > > > That will migrate everything out of ZONE_MOVABLE every time it's pinned. > One consequence is that direct IO can never use ZONE_MOVABLE on these > systems. It'll create a variation of the lowmem exhaustion problem. For example, there is 4G highmem zone and half of it is movable zone. In thit case, we can use extra 2G highmem zone space instead of lowmem. But I agree it could end up pinning many pages of lowmem so the problem would happens. IMHO, it should be trade-off for using MEMORY-HOTPLUG/CMA? > > > IMHO, get_user_pages isn't performance sensitive function. If user was sensitive > > about it, he should have tried get_user_pages_fast. > > That opens a different cans of works. get_user_pages is part of the > gup_fast slowpath. > > > THP degradation by increasing MIGRATE_UNMOVABLE? > > The patch should not be converting MIGRATE_MOVABLE requests to > MIGRATE_UNMOVABLE. I covered this in the review of v1. I guess memory-hotplug guys want to use GUP_NM for long-time pin user. So doesn't it make sense to migrate the page into MIGRATE_UNMOVABLE? But I'm not sure GUP_NM's semantic. > > > Lin said most of GUP pages release the page in short so is it really problem? > > Even in embedded, we don't use THP yet but CMA and GUP call would be not too often > > but failing of CMA would be critical. > > > > To guarantee CMA can migrate pages pinned by drivers I think you need > migrate-related callsbacks to unpin, barrier the driver until migration > completes and repin. I agree it's a ideal solution when we consider in future but as you already mentioned, it's not easy for all drivers. In fact, I don't want to insist on my opinion for CMA because I guess CMA design is not good from the beginning. I just posted my concern and want to discuss to solve the problem but if there are not plain solution now, let me pass the decision to maintainer. Thanks for sharing your opinion, Mel! > > I do not know, or at least have no heard, of anyone working on such a > scheme. > > -- > Mel Gorman > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx174.postini.com [74.125.245.174]) by kanga.kvack.org (Postfix) with SMTP id 6F1626B0008 for ; Wed, 20 Feb 2013 06:38:11 -0500 (EST) Received: from /spool/local by e23smtp07.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 20 Feb 2013 21:30:46 +1000 Date: Wed, 20 Feb 2013 19:37:57 +0800 From: Wanpeng Li Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130220113757.GA10124@hacker.(null)> Reply-To: Wanpeng Li References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Lin Feng Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: >get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > >This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. >It's a wrapper of get_user_pages() but it makes sure that all pages come from >non-movable zone via additional page migration. > >Cc: Andrew Morton >Cc: Mel Gorman >Cc: KAMEZAWA Hiroyuki >Cc: Yasuaki Ishimatsu >Cc: Jeff Moyer >Cc: Minchan Kim >Cc: Zach Brown >Reviewed-by: Tang Chen >Reviewed-by: Gu Zheng >Signed-off-by: Lin Feng >--- > include/linux/mm.h | 3 ++ > include/linux/mmzone.h | 4 ++ > mm/memory.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ > mm/page_isolation.c | 5 +++ > 4 files changed, 95 insertions(+), 0 deletions(-) > >diff --git a/include/linux/mm.h b/include/linux/mm.h >index 12f5a09..3ff9eba 100644 >--- a/include/linux/mm.h >+++ b/include/linux/mm.h >@@ -1049,6 +1049,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > struct page **pages, struct vm_area_struct **vmas); > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages); >+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas); > struct kvec; > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > struct page **pages); >diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >index e25ab6f..1506351 100644 >--- a/include/linux/mmzone.h >+++ b/include/linux/mmzone.h >@@ -841,6 +841,10 @@ static inline int is_normal_idx(enum zone_type idx) > return (idx == ZONE_NORMAL); > } > >+static inline int zone_is_movable(struct zone *zone) >+{ >+ return zone_idx(zone) == ZONE_MOVABLE; >+} > /** > * is_highmem - helper function to quickly check if a struct zone is a > * highmem zone or not. This is an attempt to keep references >diff --git a/mm/memory.c b/mm/memory.c >index bb1369f..ede53cc 100644 >--- a/mm/memory.c >+++ b/mm/memory.c >@@ -58,6 +58,8 @@ > #include > #include > #include >+#include >+#include > #include > > #include >@@ -1995,6 +1997,87 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > } > EXPORT_SYMBOL(get_user_pages); > >+#ifdef CONFIG_MEMORY_HOTREMOVE >+/** >+ * It's a wrapper of get_user_pages() but it makes sure that all pages come from >+ * non-movable zone via additional page migration. It's designed for memory >+ * hotremove framework. >+ * >+ * Currently get_user_pages() always tries to allocate pages from movable zone, >+ * in some case users of get_user_pages() is easy to pin user pages for a long >+ * time(for now we found that pages pinned as aio ring pages is such case), >+ * which is fatal for memory hotremove framework. >+ * >+ * This function first calls get_user_pages() to get the candidate pages, and >+ * then check to ensure all pages are from non movable zone. Otherwise migrate How about "Otherwise migrate candidate pages which have already been isolated to non movable zone."? >+ * them to non movable zone, then retry. It will at most retry once. >+ */ >+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas) >+{ >+ int ret, i, isolate_err, migrate_pre_flag; >+ LIST_HEAD(pagelist); >+ >+retry: >+ ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >+ vmas); >+ if (ret <= 0) >+ return ret; >+ >+ isolate_err = 0; >+ migrate_pre_flag = 0; >+ >+ for (i = 0; i < ret; i++) { >+ if (zone_is_movable(page_zone(pages[i]))) { >+ if (!migrate_pre_flag) { >+ if (migrate_prep()) >+ goto release_page; >+ migrate_pre_flag = 1; >+ } >+ >+ if (!isolate_lru_page(pages[i])) { >+ inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >+ page_is_file_cache(pages[i])); >+ list_add_tail(&pages[i]->lru, &pagelist); >+ } else { >+ isolate_err = 1; >+ goto release_page; >+ } >+ } >+ } >+ >+ /* All pages are non movable, we are done :) */ >+ if (i == ret && list_empty(&pagelist)) >+ return ret; >+ >+release_page: >+ /* Undo the effects of former get_user_pages(), we won't pin anything */ >+ release_pages(pages, ret, 1); >+ >+ if (migrate_pre_flag && !isolate_err) { >+ ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >+ false, MIGRATE_SYNC, MR_SYSCALL); >+ /* Steal pages from non-movable zone successfully? */ >+ if (!ret) >+ goto retry; >+ } >+ >+ putback_lru_pages(&pagelist); >+ /* Migration failed, we pin 0 page, tell caller the truth */ >+ return 0; >+} >+#else >+inline int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas) >+{ >+ return get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >+ vmas); >+} >+#endif >+EXPORT_SYMBOL(get_user_pages_non_movable); >+ > /** > * get_dump_page() - pin user page in memory while writing it to core dump > * @addr: user address >diff --git a/mm/page_isolation.c b/mm/page_isolation.c >index 383bdbb..1b7bd17 100644 >--- a/mm/page_isolation.c >+++ b/mm/page_isolation.c >@@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, > return ret ? 0 : -EBUSY; > } > >+/** >+ * @private: 0 means page can be alloced from movable zone, otherwise forbidden >+ */ > struct page *alloc_migrate_target(struct page *page, unsigned long private, > int **resultp) > { >@@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private, > > if (PageHighMem(page)) > gfp_mask |= __GFP_HIGHMEM; >+ if (unlikely(private != 0)) >+ gfp_mask &= ~__GFP_MOVABLE; > > return alloc_page(gfp_mask); > } >-- >1.7.1 > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <5124C3E6.1060108@cn.fujitsu.com> Date: Wed, 20 Feb 2013 20:39:02 +0800 From: Lin Feng MIME-Version: 1.0 Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130220113757.GA10124@hacker.(null)> In-Reply-To: <20130220113757.GA10124@hacker.(null)> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Wanpeng Li Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Hi Wanpeng, On 02/20/2013 07:37 PM, Wanpeng Li wrote: >> + * This function first calls get_user_pages() to get the candidate pages, and >> >+ * then check to ensure all pages are from non movable zone. Otherwise migrate > How about "Otherwise migrate candidate pages which have already been > isolated to non movable zone."? > Which is just what the code does, I'm feeling that it's too detailed to be proper :( Do we have to comment it like that detailedly? thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wanpeng Li Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 19:37:57 +0800 Message-ID: <19348.4896830798$1361360320@news.gmane.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Reply-To: Wanpeng Li Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from kanga.kvack.org ([205.233.56.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1U880Q-0006kq-72 for glkm-linux-mm-2@m.gmane.org; Wed, 20 Feb 2013 12:38:34 +0100 Received: from psmtp.com (na3sys010amx174.postini.com [74.125.245.174]) by kanga.kvack.org (Postfix) with SMTP id 6F1626B0008 for ; Wed, 20 Feb 2013 06:38:11 -0500 (EST) Received: from /spool/local by e23smtp07.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 20 Feb 2013 21:30:46 +1000 Content-Disposition: inline In-Reply-To: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: >get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > >This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. >It's a wrapper of get_user_pages() but it makes sure that all pages come from >non-movable zone via additional page migration. > >Cc: Andrew Morton >Cc: Mel Gorman >Cc: KAMEZAWA Hiroyuki >Cc: Yasuaki Ishimatsu >Cc: Jeff Moyer >Cc: Minchan Kim >Cc: Zach Brown >Reviewed-by: Tang Chen >Reviewed-by: Gu Zheng >Signed-off-by: Lin Feng >--- > include/linux/mm.h | 3 ++ > include/linux/mmzone.h | 4 ++ > mm/memory.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ > mm/page_isolation.c | 5 +++ > 4 files changed, 95 insertions(+), 0 deletions(-) > >diff --git a/include/linux/mm.h b/include/linux/mm.h >index 12f5a09..3ff9eba 100644 >--- a/include/linux/mm.h >+++ b/include/linux/mm.h >@@ -1049,6 +1049,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > struct page **pages, struct vm_area_struct **vmas); > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages); >+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas); > struct kvec; > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > struct page **pages); >diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >index e25ab6f..1506351 100644 >--- a/include/linux/mmzone.h >+++ b/include/linux/mmzone.h >@@ -841,6 +841,10 @@ static inline int is_normal_idx(enum zone_type idx) > return (idx == ZONE_NORMAL); > } > >+static inline int zone_is_movable(struct zone *zone) >+{ >+ return zone_idx(zone) == ZONE_MOVABLE; >+} > /** > * is_highmem - helper function to quickly check if a struct zone is a > * highmem zone or not. This is an attempt to keep references >diff --git a/mm/memory.c b/mm/memory.c >index bb1369f..ede53cc 100644 >--- a/mm/memory.c >+++ b/mm/memory.c >@@ -58,6 +58,8 @@ > #include > #include > #include >+#include >+#include > #include > > #include >@@ -1995,6 +1997,87 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > } > EXPORT_SYMBOL(get_user_pages); > >+#ifdef CONFIG_MEMORY_HOTREMOVE >+/** >+ * It's a wrapper of get_user_pages() but it makes sure that all pages come from >+ * non-movable zone via additional page migration. It's designed for memory >+ * hotremove framework. >+ * >+ * Currently get_user_pages() always tries to allocate pages from movable zone, >+ * in some case users of get_user_pages() is easy to pin user pages for a long >+ * time(for now we found that pages pinned as aio ring pages is such case), >+ * which is fatal for memory hotremove framework. >+ * >+ * This function first calls get_user_pages() to get the candidate pages, and >+ * then check to ensure all pages are from non movable zone. Otherwise migrate How about "Otherwise migrate candidate pages which have already been isolated to non movable zone."? >+ * them to non movable zone, then retry. It will at most retry once. >+ */ >+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas) >+{ >+ int ret, i, isolate_err, migrate_pre_flag; >+ LIST_HEAD(pagelist); >+ >+retry: >+ ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >+ vmas); >+ if (ret <= 0) >+ return ret; >+ >+ isolate_err = 0; >+ migrate_pre_flag = 0; >+ >+ for (i = 0; i < ret; i++) { >+ if (zone_is_movable(page_zone(pages[i]))) { >+ if (!migrate_pre_flag) { >+ if (migrate_prep()) >+ goto release_page; >+ migrate_pre_flag = 1; >+ } >+ >+ if (!isolate_lru_page(pages[i])) { >+ inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >+ page_is_file_cache(pages[i])); >+ list_add_tail(&pages[i]->lru, &pagelist); >+ } else { >+ isolate_err = 1; >+ goto release_page; >+ } >+ } >+ } >+ >+ /* All pages are non movable, we are done :) */ >+ if (i == ret && list_empty(&pagelist)) >+ return ret; >+ >+release_page: >+ /* Undo the effects of former get_user_pages(), we won't pin anything */ >+ release_pages(pages, ret, 1); >+ >+ if (migrate_pre_flag && !isolate_err) { >+ ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >+ false, MIGRATE_SYNC, MR_SYSCALL); >+ /* Steal pages from non-movable zone successfully? */ >+ if (!ret) >+ goto retry; >+ } >+ >+ putback_lru_pages(&pagelist); >+ /* Migration failed, we pin 0 page, tell caller the truth */ >+ return 0; >+} >+#else >+inline int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas) >+{ >+ return get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >+ vmas); >+} >+#endif >+EXPORT_SYMBOL(get_user_pages_non_movable); >+ > /** > * get_dump_page() - pin user page in memory while writing it to core dump > * @addr: user address >diff --git a/mm/page_isolation.c b/mm/page_isolation.c >index 383bdbb..1b7bd17 100644 >--- a/mm/page_isolation.c >+++ b/mm/page_isolation.c >@@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, > return ret ? 0 : -EBUSY; > } > >+/** >+ * @private: 0 means page can be alloced from movable zone, otherwise forbidden >+ */ > struct page *alloc_migrate_target(struct page *page, unsigned long private, > int **resultp) > { >@@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private, > > if (PageHighMem(page)) > gfp_mask |= __GFP_HIGHMEM; >+ if (unlikely(private != 0)) >+ gfp_mask &= ~__GFP_MOVABLE; > > return alloc_page(gfp_mask); > } >-- >1.7.1 > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wanpeng Li Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Wed, 20 Feb 2013 19:37:57 +0800 Message-ID: <2773.66713057763$1361360322@news.gmane.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Reply-To: Wanpeng Li Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> Sender: owner-linux-aio@kvack.org Cc: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng List-Id: linux-mm.kvack.org On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: >get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > >This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. >It's a wrapper of get_user_pages() but it makes sure that all pages come from >non-movable zone via additional page migration. > >Cc: Andrew Morton >Cc: Mel Gorman >Cc: KAMEZAWA Hiroyuki >Cc: Yasuaki Ishimatsu >Cc: Jeff Moyer >Cc: Minchan Kim >Cc: Zach Brown >Reviewed-by: Tang Chen >Reviewed-by: Gu Zheng >Signed-off-by: Lin Feng >--- > include/linux/mm.h | 3 ++ > include/linux/mmzone.h | 4 ++ > mm/memory.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ > mm/page_isolation.c | 5 +++ > 4 files changed, 95 insertions(+), 0 deletions(-) > >diff --git a/include/linux/mm.h b/include/linux/mm.h >index 12f5a09..3ff9eba 100644 >--- a/include/linux/mm.h >+++ b/include/linux/mm.h >@@ -1049,6 +1049,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > struct page **pages, struct vm_area_struct **vmas); > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages); >+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas); > struct kvec; > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > struct page **pages); >diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >index e25ab6f..1506351 100644 >--- a/include/linux/mmzone.h >+++ b/include/linux/mmzone.h >@@ -841,6 +841,10 @@ static inline int is_normal_idx(enum zone_type idx) > return (idx == ZONE_NORMAL); > } > >+static inline int zone_is_movable(struct zone *zone) >+{ >+ return zone_idx(zone) == ZONE_MOVABLE; >+} > /** > * is_highmem - helper function to quickly check if a struct zone is a > * highmem zone or not. This is an attempt to keep references >diff --git a/mm/memory.c b/mm/memory.c >index bb1369f..ede53cc 100644 >--- a/mm/memory.c >+++ b/mm/memory.c >@@ -58,6 +58,8 @@ > #include > #include > #include >+#include >+#include > #include > > #include >@@ -1995,6 +1997,87 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > } > EXPORT_SYMBOL(get_user_pages); > >+#ifdef CONFIG_MEMORY_HOTREMOVE >+/** >+ * It's a wrapper of get_user_pages() but it makes sure that all pages come from >+ * non-movable zone via additional page migration. It's designed for memory >+ * hotremove framework. >+ * >+ * Currently get_user_pages() always tries to allocate pages from movable zone, >+ * in some case users of get_user_pages() is easy to pin user pages for a long >+ * time(for now we found that pages pinned as aio ring pages is such case), >+ * which is fatal for memory hotremove framework. >+ * >+ * This function first calls get_user_pages() to get the candidate pages, and >+ * then check to ensure all pages are from non movable zone. Otherwise migrate How about "Otherwise migrate candidate pages which have already been isolated to non movable zone."? >+ * them to non movable zone, then retry. It will at most retry once. >+ */ >+int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas) >+{ >+ int ret, i, isolate_err, migrate_pre_flag; >+ LIST_HEAD(pagelist); >+ >+retry: >+ ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >+ vmas); >+ if (ret <= 0) >+ return ret; >+ >+ isolate_err = 0; >+ migrate_pre_flag = 0; >+ >+ for (i = 0; i < ret; i++) { >+ if (zone_is_movable(page_zone(pages[i]))) { >+ if (!migrate_pre_flag) { >+ if (migrate_prep()) >+ goto release_page; >+ migrate_pre_flag = 1; >+ } >+ >+ if (!isolate_lru_page(pages[i])) { >+ inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >+ page_is_file_cache(pages[i])); >+ list_add_tail(&pages[i]->lru, &pagelist); >+ } else { >+ isolate_err = 1; >+ goto release_page; >+ } >+ } >+ } >+ >+ /* All pages are non movable, we are done :) */ >+ if (i == ret && list_empty(&pagelist)) >+ return ret; >+ >+release_page: >+ /* Undo the effects of former get_user_pages(), we won't pin anything */ >+ release_pages(pages, ret, 1); >+ >+ if (migrate_pre_flag && !isolate_err) { >+ ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >+ false, MIGRATE_SYNC, MR_SYSCALL); >+ /* Steal pages from non-movable zone successfully? */ >+ if (!ret) >+ goto retry; >+ } >+ >+ putback_lru_pages(&pagelist); >+ /* Migration failed, we pin 0 page, tell caller the truth */ >+ return 0; >+} >+#else >+inline int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >+ unsigned long start, int nr_pages, int write, int force, >+ struct page **pages, struct vm_area_struct **vmas) >+{ >+ return get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >+ vmas); >+} >+#endif >+EXPORT_SYMBOL(get_user_pages_non_movable); >+ > /** > * get_dump_page() - pin user page in memory while writing it to core dump > * @addr: user address >diff --git a/mm/page_isolation.c b/mm/page_isolation.c >index 383bdbb..1b7bd17 100644 >--- a/mm/page_isolation.c >+++ b/mm/page_isolation.c >@@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, > return ret ? 0 : -EBUSY; > } > >+/** >+ * @private: 0 means page can be alloced from movable zone, otherwise forbidden >+ */ > struct page *alloc_migrate_target(struct page *page, unsigned long private, > int **resultp) > { >@@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private, > > if (PageHighMem(page)) > gfp_mask |= __GFP_HIGHMEM; >+ if (unlikely(private != 0)) >+ gfp_mask &= ~__GFP_MOVABLE; > > return alloc_page(gfp_mask); > } >-- >1.7.1 > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <5190AE4F.4000103@cn.fujitsu.com> Date: Mon, 13 May 2013 17:11:43 +0800 From: Tang Chen MIME-Version: 1.0 Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> In-Reply-To: <20130206095617.GN21389@suse.de> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15; format=flowed Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Hi Mel, On 02/06/2013 05:56 PM, Mel Gorman wrote: > > There is the possibility that callbacks could be introduced for > migrate_unpin() and migrate_pin() that takes a list of PFN pairs > (old,new). The unpin callback should release the old PFNs and barrier > against any operations until the migrate_pfn() callback is called with > the updated pfns to be repinned. Again it would fully depend on subsystems > implementing it properly. > > The callback interface would be more robust but puts a lot more work on > the driver side where your milage will vary. > I'm very interested in the "callback" way you said. For memory hot-remove case, the aio pages are pined in memory and making the pages cannot be offlined, furthermore, the pages cannot be removed. IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio subsystem, and call them when hot-remove code tries to offline pages, right ? If so, I'm wondering where should we put this callback pointers ? In struct page ? It has been a long time since this topic was discussed. But to solve this problem cleanly for hotplug guys and CMA guys, please give some more comments. Thanks. :) > > To guarantee CMA can migrate pages pinned by drivers I think you need > migrate-related callsbacks to unpin, barrier the driver until migration > completes and repin. > > I do not know, or at least have no heard, of anyone working on such a > scheme. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 13 May 2013 10:19:02 +0100 From: Mel Gorman Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130513091902.GP11497@suse.de> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <5190AE4F.4000103@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tang Chen Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote: > Hi Mel, > > On 02/06/2013 05:56 PM, Mel Gorman wrote: > > > >There is the possibility that callbacks could be introduced for > >migrate_unpin() and migrate_pin() that takes a list of PFN pairs > >(old,new). The unpin callback should release the old PFNs and barrier > >against any operations until the migrate_pfn() callback is called with > >the updated pfns to be repinned. Again it would fully depend on subsystems > >implementing it properly. > > > >The callback interface would be more robust but puts a lot more work on > >the driver side where your milage will vary. > > > > I'm very interested in the "callback" way you said. > > For memory hot-remove case, the aio pages are pined in memory and making > the pages cannot be offlined, furthermore, the pages cannot be removed. > > IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio > subsystem, and call them when hot-remove code tries to offline > pages, right ? > > If so, I'm wondering where should we put this callback pointers ? > In struct page ? > No, I would expect the callbacks to be part the address space operations which can be found via page->mapping. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 13 May 2013 10:37:57 -0400 From: Benjamin LaHaise Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130513143757.GP31899@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130513091902.GP11497@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Tang Chen , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On Mon, May 13, 2013 at 10:19:02AM +0100, Mel Gorman wrote: > On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote: ... > > If so, I'm wondering where should we put this callback pointers ? > > In struct page ? > > > > No, I would expect the callbacks to be part the address space operations > which can be found via page->mapping. If someone adds those callbacks and provides a means for testing them, it would be pretty trivial to change the aio code to migrate its pinned pages on demand. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Jeff Moyer Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> Date: Mon, 13 May 2013 10:54:03 -0400 In-Reply-To: <20130513143757.GP31899@kvack.org> (Benjamin LaHaise's message of "Mon, 13 May 2013 10:37:57 -0400") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: Benjamin LaHaise Cc: Mel Gorman , Tang Chen , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Benjamin LaHaise writes: > On Mon, May 13, 2013 at 10:19:02AM +0100, Mel Gorman wrote: >> On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote: > ... >> > If so, I'm wondering where should we put this callback pointers ? >> > In struct page ? >> > >> >> No, I would expect the callbacks to be part the address space operations >> which can be found via page->mapping. > > If someone adds those callbacks and provides a means for testing them, > it would be pretty trivial to change the aio code to migrate its pinned > pages on demand. How do you propose to move the ring pages? Cheers, Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 13 May 2013 11:01:47 -0400 From: Benjamin LaHaise Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130513150147.GQ31899@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Jeff Moyer Cc: Mel Gorman , Tang Chen , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: > How do you propose to move the ring pages? It's the same problem as doing a TLB shootdown: flush the old pages from userspace's mapping, copy any existing data to the new pages, then repopulate the page tables. It will likely require the addition of address_space_operations for the mapping, but that's not too hard to do. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <5191B5B3.7080406@cn.fujitsu.com> Date: Tue, 14 May 2013 11:55:31 +0800 From: Tang Chen MIME-Version: 1.0 Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> In-Reply-To: <20130513091902.GP11497@suse.de> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15; format=flowed Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Hi Mel, On 05/13/2013 05:19 PM, Mel Gorman wrote: >> For memory hot-remove case, the aio pages are pined in memory and making >> the pages cannot be offlined, furthermore, the pages cannot be removed. >> >> IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio >> subsystem, and call them when hot-remove code tries to offline >> pages, right ? >> >> If so, I'm wondering where should we put this callback pointers ? >> In struct page ? >> > > No, I would expect the callbacks to be part the address space operations > which can be found via page->mapping. > Two more problems I don't quite understand. 1. For an anonymous page, it has no address_space, and no address space operation. But the aio ring problem just happened when dealing with anonymous pages. Please refer to: (https://lkml.org/lkml/2012/11/29/69) If we put the the callbacks in page->mapping->a_ops, the anonymous pages won't be able to use them. And we cannot give a default callback because the situation we are dealing with is a special situation. So where to put the callback for anonymous pages ? 2. How to find out the reason why page->count != 1 in migrate_page_move_mapping() ? In the problem we are dealing with, get_user_pages() is called to pin the pages in memory. And the pages are migratable. So we want to decrease the page->count. But get_user_pages() is not the only reason leading to page->count increased. How can I know when should decrease teh page->count or when should not ? The way I can figure out is to assign the callback pointer in get_user_pages() because it is get_user_pages() who pins the pages. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 14 May 2013 09:58:50 -0400 From: Benjamin LaHaise Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130514135850.GG13845@kvack.org> References: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> <20130513150147.GQ31899@kvack.org> <5191926A.2090608@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5191926A.2090608@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tang Chen Cc: Jeff Moyer , Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote: > Hi Mel, Benjamin, Jeff, > > On 05/13/2013 11:01 PM, Benjamin LaHaise wrote: > >On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: > >>How do you propose to move the ring pages? > > > >It's the same problem as doing a TLB shootdown: flush the old pages from > >userspace's mapping, copy any existing data to the new pages, then > >repopulate the page tables. It will likely require the addition of > >address_space_operations for the mapping, but that's not too hard to do. > > > > I think we add migrate_unpin() callback to decrease page->count if > necessary, > and migrate the page to a new page, and add migrate_pin() callback to pin > the new page again. You can't just decrease the page count for this to work. The pages are pinned because aio_complete() can occur at any time and needs to have a place to write the completion events. When changing pages, aio has to take the appropriate lock when changing one page for another. > The migrate procedure will work just as before. We use callbacks to > decrease > the page->count before migration starts, and increase it when the migration > is done. > > And migrate_pin() and migrate_unpin() callbacks will be added to > struct address_space_operations. I think the existing migratepage operation in address_space_operations can be used. Does it get called when hot unplug occurs? That is: is testing with the migrate_pages syscall similar enough to the memory removal case? -ben > Is that right ? > > If so, I'll be working on it. > > Thanks. :) -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <5193377E.30102@cn.fujitsu.com> Date: Wed, 15 May 2013 15:21:34 +0800 From: Tang Chen MIME-Version: 1.0 Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> <20130513150147.GQ31899@kvack.org> <5191926A.2090608@cn.fujitsu.com> <20130514135850.GG13845@kvack.org> <5192EE40.7060407@cn.fujitsu.com> In-Reply-To: <5192EE40.7060407@cn.fujitsu.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1; format=flowed Sender: owner-linux-mm@kvack.org List-ID: To: Benjamin LaHaise , Mel Gorman Cc: Jeff Moyer , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Hi Benjamin, Mel, On 05/15/2013 10:09 AM, Tang Chen wrote: > Hi Benjamin, Mel, > > Please see below. > > On 05/14/2013 09:58 PM, Benjamin LaHaise wrote: >> On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote: >>> Hi Mel, Benjamin, Jeff, >>> >>> On 05/13/2013 11:01 PM, Benjamin LaHaise wrote: >>>> On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: >>>>> How do you propose to move the ring pages? >>>> >>>> It's the same problem as doing a TLB shootdown: flush the old pages >>>> from >>>> userspace's mapping, copy any existing data to the new pages, then >>>> repopulate the page tables. It will likely require the addition of >>>> address_space_operations for the mapping, but that's not too hard to >>>> do. >>>> >>> >>> I think we add migrate_unpin() callback to decrease page->count if >>> necessary, >>> and migrate the page to a new page, and add migrate_pin() callback to >>> pin >>> the new page again. >> >> You can't just decrease the page count for this to work. The pages are >> pinned because aio_complete() can occur at any time and needs to have a >> place to write the completion events. When changing pages, aio has to >> take the appropriate lock when changing one page for another. > > In aio_complete(), > > aio_complete() { > ...... > spin_lock_irqsave(&ctx->completion_lock, flags); > //write the completion event. > spin_unlock_irqrestore(&ctx->completion_lock, flags); > ...... > } > > So for this problem, I think we can hold kioctx->completion_lock in the aio > callbacks to prevent aio subsystem accessing pages who are being migrated. > Another problem here is: We intend to call these callbacks in the page migrate path, and we need to know which lock to hold. But there is no way for migrate path to know this info. The migrate path is common for all kinds of pages, so we cannot pass any specific parameter to the callbacks in migrate path. When we get a page, we cannot get any kioctx info from the page. So how can the callback know which lock to require without any parameter ? Or do we have any other way to do so ? Would you please give some more advice about this ? BTW, we also need to update kioctx->ring_pages. Thanks. :) >> >>> The migrate procedure will work just as before. We use callbacks to >>> decrease >>> the page->count before migration starts, and increase it when the >>> migration >>> is done. >>> >>> And migrate_pin() and migrate_unpin() callbacks will be added to >>> struct address_space_operations. >> >> I think the existing migratepage operation in address_space_operations >> can >> be used. Does it get called when hot unplug occurs? That is: is testing >> with the migrate_pages syscall similar enough to the memory removal case? >> > > But as I said, for anonymous pages such as aio ring buffer, they don't have > address_space_operations. So where should we put the callbacks' pointers ? > > Add something like address_space_operations to struct anon_vma ? > > Thanks. :) > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 15 May 2013 14:24:53 +0100 From: Mel Gorman Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130515132453.GB11497@suse.de> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <5191B5B3.7080406@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tang Chen Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On Tue, May 14, 2013 at 11:55:31AM +0800, Tang Chen wrote: > Hi Mel, > > On 05/13/2013 05:19 PM, Mel Gorman wrote: > >>For memory hot-remove case, the aio pages are pined in memory and making > >>the pages cannot be offlined, furthermore, the pages cannot be removed. > >> > >>IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio > >>subsystem, and call them when hot-remove code tries to offline > >>pages, right ? > >> > >>If so, I'm wondering where should we put this callback pointers ? > >>In struct page ? > >> > > > >No, I would expect the callbacks to be part the address space operations > >which can be found via page->mapping. > > > > Two more problems I don't quite understand. > Bear in mind I've done no research on this particular problem. At best, the migrate pin/unpin is the direction that I'd start with if I was tasked with fixing this (which I'm not). Hence, I cannot answer your questions at the level of detail you are looking for. > 1. For an anonymous page, it has no address_space, and no address space > operation. But the aio ring problem just happened when dealing with > anonymous pages. Please refer to: > (https://lkml.org/lkml/2012/11/29/69) > If it is to be an address space operations sturcture then you'll need a pseudo mapping structure for anonymous pages that are pinned by aio -- similar in principal to how swapper_space is used for managing PageSwapCache or how anon_vma structures can be associated with a page. However, I warn you that you may find that the address_space is the wrong level to register such callbacks, it just seemed like the obvious first choice. A potential alternative implementation is to create a 1:1 association between pages and a long-lived holder that is stored on a hash table (similar style of arrangement as page_waitqueue). A page is looked up in the hash table and if an entry exists, it points to an callback structure to the subsystem holding the pin. It's up to the subsystem to register the callbacks when it is about to pin a page (get_user_pages_longlived(...., &release_ops) and figure out how to release the pin safely. > 2. How to find out the reason why page->count != 1 in > migrate_page_move_mapping() ? > > In the problem we are dealing with, get_user_pages() is called to > pin the pages > in memory. And the pages are migratable. So we want to decrease > the page->count. > > But get_user_pages() is not the only reason leading to > page->count increased. > How can I know when should decrease teh page->count or when should not ? > You cannot just arbitrarily drop the page->count without causing problems. It has to be released by the subsystem holding the pin because only it can know when it's safe. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <5194748A.5070700@cn.fujitsu.com> Date: Thu, 16 May 2013 13:54:18 +0800 From: Tang Chen MIME-Version: 1.0 Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> In-Reply-To: <20130515132453.GB11497@suse.de> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15; format=flowed Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Hi Mel, On 05/15/2013 09:24 PM, Mel Gorman wrote: > If it is to be an address space operations sturcture then you'll need a > pseudo mapping structure for anonymous pages that are pinned by aio -- > similar in principal to how swapper_space is used for managing PageSwapCache > or how anon_vma structures can be associated with a page. > > However, I warn you that you may find that the address_space is the > wrong level to register such callbacks, it just seemed like the obvious > first choice. A potential alternative implementation is to create a 1:1 > association between pages and a long-lived holder that is stored on a hash > table (similar style of arrangement as page_waitqueue). A page is looked up > in the hash table and if an entry exists, it points to an callback structure > to the subsystem holding the pin. It's up to the subsystem to register the > callbacks when it is about to pin a page (get_user_pages_longlived(...., > &release_ops) and figure out how to release the pin safely. > OK, I'll try to figure out a proper place to put the callbacks. But I think we need to add something new to struct page. I'm just not sure if it is OK. Maybe we can discuss more about it when I send a RFC patch. Thanks for the advices, and I'll try them. Thanks. :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 16 May 2013 20:23:49 -0400 From: Benjamin LaHaise Subject: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130517002349.GI1008@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5194748A.5070700@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tang Chen Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On Thu, May 16, 2013 at 01:54:18PM +0800, Tang Chen wrote: ... > OK, I'll try to figure out a proper place to put the callbacks. > But I think we need to add something new to struct page. I'm just > not sure if it is OK. Maybe we can discuss more about it when I send > a RFC patch. ... I ended up working on this a bit today, and managed to cobble together something that somewhat works -- please see the patch below. It still is not completely tested, and it has a rather nasty bug owing to the fact that the file descriptors returned by anon_inode_getfile() all share the same inode (read: more than one instance of aio does not work), but it shows the basic idea. Also, bad things probably happen if someone does an mremap() on the aio ring buffer. I'll polish this off sometime next week after the long weekend if noone beats me to it. -ben -- "Thought is the essence of where you are now." fs/aio.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/migrate.h | 3 + mm/migrate.c | 2 - 3 files changed, 96 insertions(+), 4 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index c5b1a8c..dbad23e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -35,6 +35,9 @@ #include #include #include +#include +#include +#include #include #include @@ -108,6 +111,7 @@ struct kioctx { } ____cacheline_aligned_in_smp; struct page *internal_pages[AIO_RING_PAGES]; + struct file *ctx_file; }; /*------ sysctl variables----*/ @@ -146,8 +150,59 @@ static void aio_free_ring(struct kioctx *ctx) if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) kfree(ctx->ring_pages); + + if (ctx->ctx_file) { + truncate_setsize(ctx->ctx_file->f_inode, 0); + fput(ctx->ctx_file); + ctx->ctx_file = NULL; + } +} + +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma->vm_ops = &generic_file_vm_ops; + return 0; +} + +static const struct file_operations aio_ctx_fops = { + .mmap = aio_ctx_mmap, +}; + +static int aio_set_page_dirty(struct page *page) +{ + return 0; +} + +static int aio_migratepage(struct address_space *mapping, struct page *new, + struct page *old, enum migrate_mode mode) +{ + struct kioctx *ctx = mapping->private_data; + unsigned long flags; + unsigned idx = old->index; + int rc; + + BUG_ON(PageWriteback(old)); /* Writeback must be complete */ + put_page(old); + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); + if (rc != MIGRATEPAGE_SUCCESS) { + get_page(old); + return rc; + } + get_page(new); + + spin_lock_irqsave(&ctx->completion_lock, flags); + migrate_page_copy(new, old); + ctx->ring_pages[idx] = new; + spin_unlock_irqrestore(&ctx->completion_lock, flags); + + return MIGRATEPAGE_SUCCESS; } +static const struct address_space_operations aio_ctx_aops = { + .set_page_dirty = aio_set_page_dirty, + .migratepage = aio_migratepage, +}; + static int aio_setup_ring(struct kioctx *ctx) { struct aio_ring *ring; @@ -155,6 +210,7 @@ static int aio_setup_ring(struct kioctx *ctx) struct mm_struct *mm = current->mm; unsigned long size, populate; int nr_pages; + int i; /* Compensate for the ring buffer's head/tail overlap entry */ nr_events += 2; /* 1 is required, 2 for good luck */ @@ -166,6 +222,31 @@ static int aio_setup_ring(struct kioctx *ctx) if (nr_pages < 0) return -EINVAL; + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); + if (IS_ERR(ctx->ctx_file)) { + ctx->ctx_file = NULL; + return -EAGAIN; + } + ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops; + ctx->ctx_file->f_inode->i_mapping->private_data = ctx; + ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages; + + for (i=0; ictx_file->f_inode->i_mapping, + i, GFP_KERNEL); + if (!page) { + break; + } + ptr = kmap(page); + clear_page(ptr); + kunmap(page); + SetPageUptodate(page); + SetPageDirty(page); + unlock_page(page); + } + nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); ctx->nr_events = 0; @@ -180,20 +261,25 @@ static int aio_setup_ring(struct kioctx *ctx) ctx->mmap_size = nr_pages * PAGE_SIZE; pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size); down_write(&mm->mmap_sem); - ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size, + ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size, PROT_READ|PROT_WRITE, - MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate); + MAP_SHARED|MAP_POPULATE, 0, + &populate); if (IS_ERR((void *)ctx->mmap_base)) { up_write(&mm->mmap_sem); ctx->mmap_size = 0; aio_free_ring(ctx); return -EAGAIN; } + up_write(&mm->mmap_sem); + mm_populate(ctx->mmap_base, populate); pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, 1, 0, ctx->ring_pages, NULL); - up_write(&mm->mmap_sem); + for (i=0; inr_pages; i++) { + put_page(ctx->ring_pages[i]); + } if (unlikely(ctx->nr_pages != nr_pages)) { aio_free_ring(ctx); @@ -403,6 +489,8 @@ out_cleanup: err = -EAGAIN; aio_free_ring(ctx); out_freectx: + if (ctx->ctx_file) + fput(ctx->ctx_file); kmem_cache_free(kioctx_cachep, ctx); pr_debug("error allocating ioctx %d\n", err); return ERR_PTR(err); @@ -852,6 +940,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) ioctx = ioctx_alloc(nr_events); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { + ctx = ioctx->user_id; ret = put_user(ioctx->user_id, ctxp); if (ret) kill_ioctx(ioctx); diff --git a/include/linux/migrate.h b/include/linux/migrate.h index a405d3dc..b6f3289 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm, extern void migrate_page_copy(struct page *newpage, struct page *page); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); +extern int migrate_page_move_mapping(struct address_space *mapping, + struct page *newpage, struct page *page, + struct buffer_head *head, enum migrate_mode mode); #else static inline void putback_lru_pages(struct list_head *l) {} diff --git a/mm/migrate.c b/mm/migrate.c index 27ed225..ac9c3a9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, * 2 for pages with a mapping * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. */ -static int migrate_page_move_mapping(struct address_space *mapping, +int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, struct buffer_head *head, enum migrate_mode mode) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <5195A3F4.70803@cn.fujitsu.com> Date: Fri, 17 May 2013 11:28:52 +0800 From: Tang Chen MIME-Version: 1.0 Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> In-Reply-To: <20130517002349.GI1008@kvack.org> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1; format=flowed Sender: owner-linux-mm@kvack.org List-ID: To: Benjamin LaHaise Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Hi Benjamin, Thank you very much for your idea. :) I have no objection to your idea, but seeing from your patch, this only works for aio subsystem because you changed the way to allocate the aio ring pages, with a file mapping. So far as I know, not only aio, but also other subsystems, such CMA, will also have problem like this. The page cannot be migrated because it is pinned in memory. So I think we should work out a common way to solve how to migrate pinned pages. I'm working in the way Mel has said, migrate_unpin() and migrate_pin() callbacks. But as you saw, I met some problems, like I don't where to put these two callbacks. And discussed with you guys, I want to try this: 1. Add a new member to struct page, used to remember the pin holders of this page, including the pin and unpin callbacks and the necessary data. This is more like a callback chain. (I'm worry about this step, I'm not sure if it is good enough. After all, we need a good place to put the callbacks.) And then, like Mel said, 2. Implement the callbacks in the subsystems, and register them to the new member in struct page. 3. Call these callbacks before and after migration. I think I'll send a RFC patch next week when I finished the outline. I'm just thinking of finding a common way to solve this problem that all the other subsystems will benefit. Thanks. :) On 05/17/2013 08:23 AM, Benjamin LaHaise wrote: > On Thu, May 16, 2013 at 01:54:18PM +0800, Tang Chen wrote: > ... >> OK, I'll try to figure out a proper place to put the callbacks. >> But I think we need to add something new to struct page. I'm just >> not sure if it is OK. Maybe we can discuss more about it when I send >> a RFC patch. > ... > > I ended up working on this a bit today, and managed to cobble together > something that somewhat works -- please see the patch below. It still is > not completely tested, and it has a rather nasty bug owing to the fact > that the file descriptors returned by anon_inode_getfile() all share the > same inode (read: more than one instance of aio does not work), but it > shows the basic idea. Also, bad things probably happen if someone does > an mremap() on the aio ring buffer. I'll polish this off sometime next > week after the long weekend if noone beats me to it. > > -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 17 May 2013 10:37:18 -0400 From: Benjamin LaHaise Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130517143718.GK1008@kvack.org> References: <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5195A3F4.70803@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tang Chen Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On Fri, May 17, 2013 at 11:28:52AM +0800, Tang Chen wrote: > Hi Benjamin, > > Thank you very much for your idea. :) > > I have no objection to your idea, but seeing from your patch, this only > works for aio subsystem because you changed the way to allocate the aio > ring pages, with a file mapping. That is correct. There is no way you're going to be able to solve this problem without dealing with the issue on a subsystem by subsystem basis. > So far as I know, not only aio, but also other subsystems, such CMA, will > also have problem like this. The page cannot be migrated because it is > pinned in memory. So I think we should work out a common way to solve how > to migrate pinned pages. A generic approach would require hardware support, but I doubt that is going to happen. > I'm working in the way Mel has said, migrate_unpin() and migrate_pin() > callbacks. But as you saw, I met some problems, like I don't where to put > these two callbacks. And discussed with you guys, I want to try this: > > 1. Add a new member to struct page, used to remember the pin holders of > this page, including the pin and unpin callbacks and the necessary data. > This is more like a callback chain. > (I'm worry about this step, I'm not sure if it is good enough. After > all, > we need a good place to put the callbacks.) Putting function pointers into struct page is not going to happen. You'd be adding a significant amount of memory overhead for something that is never going to be used on the vast majority of systems (2 function pointers would be 16 bytes per page on a 64 bit system). Keep in mind that distro kernels tend to enable almost all config options on their kernels, so the overhead of any approach has to make sense for the users of the kernel that will never make use of this kind of migration. > And then, like Mel said, > > 2. Implement the callbacks in the subsystems, and register them to the > new member in struct page. No, the hook should be in the address_space_operations. We already have a pointer to an address space in struct page. This avoids adding more overhead to struct page. > 3. Call these callbacks before and after migration. How is that better than using the existing hook in address_space_operations? > I think I'll send a RFC patch next week when I finished the outline. I'm > just thinking of finding a common way to solve this problem that all the > other subsystems will benefit. Before pursuing this approach, make sure you've got buy-in for all of the overhead you're adding to the system. I don't think that growing struct page is going to be an acceptable design choice given the amount of overhead it will incur. > Thanks. :) Cheers, -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 17 May 2013 14:30:03 -0400 From: Benjamin LaHaise Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130517183003.GL1008@kvack.org> References: <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <20130517181708.GG318@lenny.home.zabbo.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130517181708.GG318@lenny.home.zabbo.net> Sender: owner-linux-mm@kvack.org List-ID: To: Zach Brown Cc: Tang Chen , Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On Fri, May 17, 2013 at 11:17:08AM -0700, Zach Brown wrote: > > I ended up working on this a bit today, and managed to cobble together > > something that somewhat works -- please see the patch below. > > Just some quick observations: > > > + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); > > + if (IS_ERR(ctx->ctx_file)) { > > + ctx->ctx_file = NULL; > > + return -EAGAIN; > > + } > > It's too bad that aio contexts will now be accounted against the filp > limits (get_empty_filp -> files_stat.max_files, etc). Yeah, that is a downside of this approach. It would be possible to to do it with only an inode/address_space, but that would mean bypassing do_mmap(), which is not worth considering. If it is really an issue, we could add a flag to bypass that limit since aio has its own. anon_inode_getfile() as it stands is a major problem. > > + for (i=0; i > + struct page *page; > > + void *ptr; > > + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping, > > + i, GFP_KERNEL); > > + if (!page) { > > + break; > > + } > > + ptr = kmap(page); > > + clear_page(ptr); > > + kunmap(page); > > + SetPageUptodate(page); > > + SetPageDirty(page); > > + unlock_page(page); > > + } > > If they're GFP_KERNEL then you don't need to kmap them. But we probably > want to allocate with GFP_HIGHUSER and then use clear_user_highpage() to > zero them? Adding __GFP_ZERO would fix that too. The next respin will include that change. I also have to properly handle the mremap() case as well. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 20 May 2013 22:27:33 -0400 From: Benjamin LaHaise Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130521022733.GT1008@kvack.org> References: <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <519AD6F8.2070504@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tang Chen Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On Tue, May 21, 2013 at 10:07:52AM +0800, Tang Chen wrote: .... > I'm not saying using two callbacks before and after migration is better. > I don't want to use address_space_operations is because there is no such > member > for anonymous pages. That depends on the nature of the pinning. For the general case of get_user_pages(), you're correct that it won't work for anonymous memory. > In your idea, using a file mapping will create a > address_space_operations. But > I really don't think we can modify the way of memory allocation for all the > subsystems who has this problem. Maybe not just aio and cma. That means if > you want to pin pages in memory, you have to use a file mapping. This makes > the memory allocation more complicated. And the idea should be known by all > the subsystem developers. Is that going to happen ? Different subsystems will need to use different approaches to fixing the issue. I doubt any single approach will work for everything. > I also thought about reuse one field of struct page. But as you said, there > may not be many users of this functionality. Reusing a field of struct page > will make things more complicated and lead to high coupling. What happens when more than one subsystem tries to pin a particular page? What if it's a shared page rather than an anonymous page? > So, how about the other idea that Mel mentioned ? > > We create a 1-1 mapping of pinned page ranges and the pinner (subsystem > callbacks and data), maybe a global list or a hash table. And then, we can > find the callbacks. Maybe that is the simplest approach, but it's going to make get_user_pages() slower and more complicated (as if it wasn't already). Maybe with all the bells and whistles of per-cpu data structures and such you can make it work, but I'm pretty sure someone running the large unmentionable benchmark will complain about the performance regressions you're going to introduce. At least in the case of the AIO ring buffer, using the address_space approach doesn't introduce any new performance issues. There's also the bigger question of if you can or cannot exclude get_user_pages_fast() from this. In short: you've got a lot more work on your hands to do. > Thanks. :) Cheers, -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <51B6F107.80501@cn.fujitsu.com> Date: Tue, 11 Jun 2013 17:42:31 +0800 From: Tang Chen MIME-Version: 1.0 Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> In-Reply-To: <20130521022733.GT1008@kvack.org> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1; format=flowed Sender: owner-linux-mm@kvack.org List-ID: To: Benjamin LaHaise Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Hi Benjamin, Are you still working on this problem ? Thanks. :) On 05/21/2013 10:27 AM, Benjamin LaHaise wrote: > On Tue, May 21, 2013 at 10:07:52AM +0800, Tang Chen wrote: > .... >> I'm not saying using two callbacks before and after migration is better. >> I don't want to use address_space_operations is because there is no such >> member >> for anonymous pages. > > That depends on the nature of the pinning. For the general case of > get_user_pages(), you're correct that it won't work for anonymous memory. > >> In your idea, using a file mapping will create a >> address_space_operations. But >> I really don't think we can modify the way of memory allocation for all the >> subsystems who has this problem. Maybe not just aio and cma. That means if >> you want to pin pages in memory, you have to use a file mapping. This makes >> the memory allocation more complicated. And the idea should be known by all >> the subsystem developers. Is that going to happen ? > > Different subsystems will need to use different approaches to fixing the > issue. I doubt any single approach will work for everything. > >> I also thought about reuse one field of struct page. But as you said, there >> may not be many users of this functionality. Reusing a field of struct page >> will make things more complicated and lead to high coupling. > > What happens when more than one subsystem tries to pin a particular page? > What if it's a shared page rather than an anonymous page? > >> So, how about the other idea that Mel mentioned ? >> >> We create a 1-1 mapping of pinned page ranges and the pinner (subsystem >> callbacks and data), maybe a global list or a hash table. And then, we can >> find the callbacks. > > Maybe that is the simplest approach, but it's going to make get_user_pages() > slower and more complicated (as if it wasn't already). Maybe with all the > bells and whistles of per-cpu data structures and such you can make it work, > but I'm pretty sure someone running the large unmentionable benchmark will > complain about the performance regressions you're going to introduce. At > least in the case of the AIO ring buffer, using the address_space approach > doesn't introduce any new performance issues. There's also the bigger > question of if you can or cannot exclude get_user_pages_fast() from this. > In short: you've got a lot more work on your hands to do. > >> Thanks. :) > > Cheers, > > -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 2 Jul 2013 14:00:08 -0400 From: Benjamin LaHaise Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130702180008.GQ16399@kvack.org> References: <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51D12E7B.6080301@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Gu Zheng Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote: > Hi Ben, > Are you still working on this patch? > As you know, using the current anon inode will lead to more than one instance of > aio can not work. Have you found a way to fix this issue? Or can we use some > other ones to replace the anon inode? This patch hasn't been a high priority for me. I would really appreciate it if someone could confirm that this patch does indeed fix the hotplug page migration issue by testing it in a system that hits the bug. Removing the anon_inode bits isn't too much work, but I'd just like to have some confirmation that this fix is considered to be "good enough" for the problem at hand before spending any further time on it. There was talk of using another approach, but it's not clear if there was any progress. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <51D3841D.9040906@cn.fujitsu.com> Date: Wed, 03 Jul 2013 09:53:33 +0800 From: Gu Zheng MIME-Version: 1.0 Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> In-Reply-To: <20130702180008.GQ16399@kvack.org> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Benjamin LaHaise Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On 07/03/2013 02:00 AM, Benjamin LaHaise wrote: > On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote: >> Hi Ben, >> Are you still working on this patch? >> As you know, using the current anon inode will lead to more than one instance of >> aio can not work. Have you found a way to fix this issue? Or can we use some >> other ones to replace the anon inode? > > This patch hasn't been a high priority for me. I would really appreciate > it if someone could confirm that this patch does indeed fix the hotplug > page migration issue by testing it in a system that hits the bug. Removing > the anon_inode bits isn't too much work, but I'd just like to have some > confirmation that this fix is considered to be "good enough" for the > problem at hand before spending any further time on it. There was talk of > using another approach, but it's not clear if there was any progress. Yeah, we have not seen anyone try to fix this issue using the other approach we talked. I'm not sure whether your patch can indeed fix the problem, but I'll carry out a complete test to confirm it, and I'll be very glad to continue this job based on your patch if you do not have enough time working on it.:) Thanks, Gu > > -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <51D51B66.3000301@cn.fujitsu.com> Date: Thu, 04 Jul 2013 14:51:18 +0800 From: Gu Zheng MIME-Version: 1.0 Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> In-Reply-To: <20130702180008.GQ16399@kvack.org> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Benjamin LaHaise Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On 07/03/2013 02:00 AM, Benjamin LaHaise wrote: > On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote: >> Hi Ben, >> Are you still working on this patch? >> As you know, using the current anon inode will lead to more than one instance of >> aio can not work. Have you found a way to fix this issue? Or can we use some >> other ones to replace the anon inode? > > This patch hasn't been a high priority for me. I would really appreciate > it if someone could confirm that this patch does indeed fix the hotplug > page migration issue by testing it in a system that hits the bug. Removing > the anon_inode bits isn't too much work, but I'd just like to have some > confirmation that this fix is considered to be "good enough" for the > problem at hand before spending any further time on it. There was talk of > using another approach, but it's not clear if there was any progress. Hi Ben, When I test your patch on kernel 3.10, the kernel panic when aio job complete or exit, exactly in aio_free_ring(), the following is a part of dmesg. Thanks, Gu kernel BUG at mm/swap.c:163! invalid opcode: 0000 [#1] SMP Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF 3.10.0-aio-migrate+ #107 Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 Workqueue: events kill_ioctx_work task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 RIP: 0010:[] [] put_page+0x48/0x60 RSP: 0018:ffff8807dda99cd8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8807be1f1e00 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea001b196c80 RBP: ffff8807dda99cd8 R08: 0000000000000000 R09: 0000000000000000 R10: ffff8807ffbb5f00 R11: 000000000000005a R12: 0000000000000001 R13: 0000000000000000 R14: ffff8807dda974e0 R15: ffff8807be1f1ec8 FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003b826dc7d0 CR3: 0000000001a0b000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8807dda99d18 ffffffff811b11f6 0000000000000000 0000000200000000 ffff8807be1f1e00 ffff8807be1f1e80 000000000000000c 0000000000000000 ffff8807dda99dc8 ffffffff811b21a2 00000001000438ec ffff8807fd692d00 Call Trace: [] aio_free_ring+0x96/0x1c0 [] free_ioctx+0x1f2/0x250 [] ? idle_balance+0xed/0x140 [] put_ioctx+0x1a/0x30 [] kill_ioctx_work+0x2f/0x40 [] process_one_work+0x183/0x490 [] worker_thread+0x120/0x3a0 [] ? manage_workers+0x160/0x160 [] kthread+0xce/0xe0 [] ? kthread_freezable_should_stop+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? kthread_freezable_should_stop+0x70/0x70 Code: 07 00 c0 75 1f f0 ff 4f 1c 0f 94 c0 84 c0 75 0b c9 66 90 c3 0f 1f 80 00 00 00 00 e8 53 fe ff ff c9 66 90 c3 e8 7a fe ff ff c9 c3 <0f> 0b 66 0f 1f 44 00 00 eb f8 48 8b 47 30 eb bc 0f 1f 84 00 00 RIP [] put_page+0x48/0x60 RSP ---[ end trace b5e2c17407c840d8 ]--- Jul 4 15:49:50 BUG: unable to handle kernel paging request at ffffffffffffffd8 IP: [] kthread_data+0x10/0x20 PGD 1a0c067 PUD 1a0e067 PMD 0 Oops: 0000 [#2] SMP Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF D 3.10.0-aio-migrate+ #107 Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 RIP: 0010:[] [] kthread_data+0x10/0x20 RSP: 0018:ffff8807dda999b8 EFLAGS: 00010092 RAX: 0000000000000000 RBX: 0000000000000004 RCX: ffffffff81da3ea0 RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8807dda974e0 RBP: ffff8807dda999b8 R08: ffff8807dda97550 R09: 0000000000000006 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 R13: ffff8807dda97ab8 R14: 0000000000000001 R15: 0000000000000006 FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000001a0b000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8807dda999d8 ffffffff8105e155 ffff8807dda999d8 ffff8807fd692d00 ffff8807dda99a68 ffffffff8154168b ffff8807dda99fd8 0000000000012d00 ffff8807dda98010 0000000000012d00 0000000000012d00 0000000000012d00 Call Trace: [] wq_worker_sleeping+0x15/0xa0 [] __schedule+0x5ab/0x6f0 [] ? put_io_context_active+0xc2/0xf0 [] schedule+0x29/0x70 [] do_exit+0x2d5/0x480 [] oops_end+0xa9/0xf0 [] die+0x5b/0x90 [] do_trap+0xcb/0x170 [] ? __atomic_notifier_call_chain+0x12/0x20 [] do_invalid_op+0x95/0xb0 [] ? put_page+0x48/0x60 [] ? truncate_inode_pages_range+0x201/0x500 [] invalid_op+0x18/0x20 [] ? put_page+0x48/0x60 [] ? truncate_setsize+0x19/0x20 [] aio_free_ring+0x96/0x1c0 [] free_ioctx+0x1f2/0x250 [] ? idle_balance+0xed/0x140 [] put_ioctx+0x1a/0x30 [] kill_ioctx_work+0x2f/0x40 [] process_one_work+0x183/0x490 [] worker_thread+0x120/0x3a0 [] ? manage_workers+0x160/0x160 [] kthread+0xce/0xe0 [] ? kthread_freezable_should_stop+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? kthread_freezable_should_stop+0x70/0x70 Code: 80 05 00 00 48 8b 40 c8 c9 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 80 05 00 00 <48> 8b 40 d8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 RIP [] kthread_data+0x10/0x20 RSP CR2: ffffffffffffffd8 ---[ end trace b5e2c17407c840d9 ]--- DP kernel: -----Fixing recursive fault but reboot is needed! -------[ cut here ]------------ Jul 4 15:49:50 DP kernel: kernel BUG at mm/swap.c:163! Jul 4 15:49:50 DP kernel: invalid opcode: 0000 [#1] SMP Jul 4 15:49:50 DP kernel: Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) Jul 4 15:49:50 DP kernel: CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF 3.10.0-aio-migrate+ #107 Jul 4 15:49:50 DP kernel: Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 Jul 4 15:49:50 DP kernel: Workqueue: events kill_ioctx_work Jul 4 15:49:50 DP kernel: task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 Jul 4 15:49:50 DP kernel: RIP: 0010:[] [] put_page+0x48/0x60 Jul 4 15:49:50 DP kernel: RSP: 0018:ffff8807dda99cd8 EFLAGS: 00010246 Jul 4 15:49:50 DP kernel: RAX: 0000000000000000 RBX: ffff8807be1f1e00 RCX: 0000000000000001 Jul 4 15:49:50 DP kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea001b196c80 Jul 4 15:49:50 DP kernel: RBP: ffff8807dda99cd8 R08: 0000000000000000 R09: 0000000000000000 Jul 4 15:49:50 DP kernel: R10: ffff8807ffbb5f00 R11: 000000000000005a R12: 0000000000000001 Jul 4 15:49:50 DP kernel: R13: 0000000000000000 R14: ffff8807dda974e0 R15: ffff8807be1f1ec8 Jul 4 15:49:50 DP kernel: FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 Jul 4 15:49:50 DP kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jul 4 15:49:50 DP kernel: CR2: 0000003b826dc7d0 CR3: 0000000001a0b000 CR4: 00000000000007e0 Jul 4 15:49:50 DP kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 4 15:49:50 DP kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jul 4 15:49:50 DP kernel: Stack: Jul 4 15:49:50 DP kernel: ffff8807dda99d18 ffffffff811b11f6 0000000000000000 0000000200000000 Jul 4 15:49:50 DP kernel: ffff8807be1f1e00 ffff8807be1f1e80 000000000000000c 0000000000000000 Jul 4 15:49:50 DP kernel: ffff8807dda99dc8 ffffffff811b21a2 00000001000438ec ffff8807fd692d00 Jul 4 15:49:50 DP kernel: Call Trace: Jul 4 15:49:50 DP kernel: [] aio_free_ring+0x96/0x1c0 Jul 4 15:49:50 DP kernel: [] free_ioctx+0x1f2/0x250 Jul 4 15:49:50 DP kernel: [] ? idle_balance+0xed/0x140 Jul 4 15:49:50 DP kernel: [] put_ioctx+0x1a/0x30 Jul 4 15:49:50 DP kernel: [] kill_ioctx_work+0x2f/0x40 Jul 4 15:49:50 DP kernel: [] process_one_work+0x183/0x490 Jul 4 15:49:50 DP kernel: [] worker_thread+0x120/0x3a0 Jul 4 15:49:50 DP kernel: [] ? manage_workers+0x160/0x160 Jul 4 15:49:50 DP kernel: [] kthread+0xce/0xe0 Jul 4 15:49:50 DP kernel: [] ? kthread_freezable_should_stop+0x70/0x70 Jul 4 15:49:50 DP kernel: [] ret_from_fork+0x7c/0xb0 Jul 4 15:49:50 DP kernel: [] ? kthread_freezable_should_stop+0x70/0x70 Jul 4 15:49:50 DP kernel: Code: 07 00 c0 75 1f f0 ff 4f 1c 0f 94 c0 84 c0 75 0b c9 66 90 c3 0f 1f 80 00 00 00 00 e8 53 fe ff ff c9 66 90 c3 e8 7a fe ff ff c9 c3 <0f> 0b 66 0f 1f 44 00 00 eb f8 48 8b 47 30 eb bc 0f 1f 84 00 00 Jul 4 15:49:50 DP kernel: RIP [] put_page+0x48/0x60 Jul 4 15:49:50 DP kernel: RSP Jul 4 15:49:50 DP kernel: ---[ end trace b5e2c17407c840d8 ]--- INFO: rcu_sched detected stalls on CPUs/tasks: { 4} (detected by 9, t=21056 jiffies, g=4158, c=4157, q=1040) sending NMI to all CPUs: NMI backtrace for cpu 4 CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF D 3.10.0-aio-migrate+ #107 Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 RIP: 0010:[] [] _raw_spin_lock_irq+0x22/0x30 RSP: 0018:ffff8807dda99618 EFLAGS: 00000002 RAX: 000000000000497c RBX: ffff8807fd692d00 RCX: ffff8807dda98010 RDX: 000000000000497e RSI: ffffffff815419a9 RDI: ffff8807fd692d00 RBP: ffff8807dda99618 R08: 0000000000000004 R09: 0000000000000100 R10: 00000000000009fe R11: 00000000000009fe R12: 0000000000000004 R13: 0000000000000009 R14: 0000000000000009 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000001a0b000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8807dda996a8 ffffffff815411b6 ffff8807dda99fd8 0000000000012d00 ffff8807dda98010 0000000000012d00 0000000000012d00 0000000000012d00 ffff8807dda99fd8 0000000000012d00 ffff8807dda974e0 ffff8807dda996c8 Call Trace: [] __schedule+0xd6/0x6f0 [] schedule+0x29/0x70 [] do_exit+0x42a/0x480 [] oops_end+0xa9/0xf0 [] no_context+0x11e/0x1f0 [] __bad_area_nosemaphore+0x11d/0x220 [] bad_area_nosemaphore+0x13/0x20 [] __do_page_fault+0xc5/0x490 [] ? call_rcu_sched+0x17/0x20 [] ? strlcpy+0x4a/0x60 [] do_page_fault+0xe/0x10 [] page_fault+0x22/0x30 [] ? kthread_data+0x10/0x20 [] wq_worker_sleeping+0x15/0xa0 [] __schedule+0x5ab/0x6f0 [] ? put_io_context_active+0xc2/0xf0 [] schedule+0x29/0x70 [] do_exit+0x2d5/0x480 [] oops_end+0xa9/0xf0 [] die+0x5b/0x90 [] do_trap+0xcb/0x170 [] ? __atomic_notifier_call_chain+0x12/0x20 [] do_invalid_op+0x95/0xb0 [] ? put_page+0x48/0x60 [] ? truncate_inode_pages_range+0x201/0x500 [] invalid_op+0x18/0x20 [] ? put_page+0x48/0x60 [] ? truncate_setsize+0x19/0x20 [] aio_free_ring+0x96/0x1c0 [] free_ioctx+0x1f2/0x250 [] ? idle_balance+0xed/0x140 [] put_ioctx+0x1a/0x30 [] kill_ioctx_work+0x2f/0x40 [] process_one_work+0x183/0x490 [] worker_thread+0x120/0x3a0 [] ? manage_workers+0x160/0x160 [] kthread+0xce/0xe0 [] ? kthread_freezable_should_stop+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? kthread_freezable_should_stop+0x70/0x70 Code: 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 fa b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0d 0f 1f 00 f3 90 <0f> b7 07 66 39 c2 75 f6 c9 c3 0f 1f 40 00 55 48 89 e5 66 66 66 NMI backtrace for cpu 1 > > -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 4 Jul 2013 07:41:53 -0400 From: Benjamin LaHaise Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130704114153.GD11006@kvack.org> References: <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> <51D51B66.3000301@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51D51B66.3000301@cn.fujitsu.com> Sender: owner-linux-mm@kvack.org List-ID: To: Gu Zheng Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On Thu, Jul 04, 2013 at 02:51:18PM +0800, Gu Zheng wrote: > Hi Ben, > When I test your patch on kernel 3.10, the kernel panic when aio job > complete or exit, exactly in aio_free_ring(), the following is a part of dmesg. What is your test case? -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <51D63B9C.4060204@cn.fujitsu.com> Date: Fri, 05 Jul 2013 11:21:00 +0800 From: Gu Zheng MIME-Version: 1.0 Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> <51D51B66.3000301@cn.fujitsu.com> <20130704114153.GD11006@kvack.org> In-Reply-To: <20130704114153.GD11006@kvack.org> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Benjamin LaHaise Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski On 07/04/2013 07:41 PM, Benjamin LaHaise wrote: > On Thu, Jul 04, 2013 at 02:51:18PM +0800, Gu Zheng wrote: >> Hi Ben, >> When I test your patch on kernel 3.10, the kernel panic when aio job >> complete or exit, exactly in aio_free_ring(), the following is a part of dmesg. > > What is your test case? Just the one you mentioned in the previous mail: http://www.kvack.org/~bcrl/aio/aio-numa-test.c Thanks, Gu > > -ben > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754037Ab3BEJXe (ORCPT ); Tue, 5 Feb 2013 04:23:34 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:20465 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1751826Ab3BEJXU (ORCPT ); Tue, 5 Feb 2013 04:23:20 -0500 X-IronPort-AV: E=Sophos;i="4.84,603,1355068800"; d="scan'208";a="6691733" From: Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng Subject: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Tue, 5 Feb 2013 17:21:52 +0800 Message-Id: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 17:21:59, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 17:22:00, Serialize complete at 2013/02/05 17:22:00 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org get_user_pages() always tries to allocate pages from movable zone, which is not reliable to memory hotremove framework in some case. This patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. Cc: Andrew Morton Cc: Mel Gorman Cc: KAMEZAWA Hiroyuki Cc: Yasuaki Ishimatsu Cc: Jeff Moyer Cc: Minchan Kim Cc: Zach Brown Reviewed-by: Tang Chen Reviewed-by: Gu Zheng Signed-off-by: Lin Feng --- include/linux/mm.h | 3 ++ include/linux/mmzone.h | 4 ++ mm/memory.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 +++ 4 files changed, 95 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 12f5a09..3ff9eba 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1049,6 +1049,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas); struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e25ab6f..1506351 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -841,6 +841,10 @@ static inline int is_normal_idx(enum zone_type idx) return (idx == ZONE_NORMAL); } +static inline int zone_is_movable(struct zone *zone) +{ + return zone_idx(zone) == ZONE_MOVABLE; +} /** * is_highmem - helper function to quickly check if a struct zone is a * highmem zone or not. This is an attempt to keep references diff --git a/mm/memory.c b/mm/memory.c index bb1369f..ede53cc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -58,6 +58,8 @@ #include #include #include +#include +#include #include #include @@ -1995,6 +1997,87 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +/** + * It's a wrapper of get_user_pages() but it makes sure that all pages come from + * non-movable zone via additional page migration. It's designed for memory + * hotremove framework. + * + * Currently get_user_pages() always tries to allocate pages from movable zone, + * in some case users of get_user_pages() is easy to pin user pages for a long + * time(for now we found that pages pinned as aio ring pages is such case), + * which is fatal for memory hotremove framework. + * + * This function first calls get_user_pages() to get the candidate pages, and + * then check to ensure all pages are from non movable zone. Otherwise migrate + * them to non movable zone, then retry. It will at most retry once. + */ +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas) +{ + int ret, i, isolate_err, migrate_pre_flag; + LIST_HEAD(pagelist); + +retry: + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, + vmas); + if (ret <= 0) + return ret; + + isolate_err = 0; + migrate_pre_flag = 0; + + for (i = 0; i < ret; i++) { + if (zone_is_movable(page_zone(pages[i]))) { + if (!migrate_pre_flag) { + if (migrate_prep()) + goto release_page; + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + + page_is_file_cache(pages[i])); + list_add_tail(&pages[i]->lru, &pagelist); + } else { + isolate_err = 1; + goto release_page; + } + } + } + + /* All pages are non movable, we are done :) */ + if (i == ret && list_empty(&pagelist)) + return ret; + +release_page: + /* Undo the effects of former get_user_pages(), we won't pin anything */ + release_pages(pages, ret, 1); + + if (migrate_pre_flag && !isolate_err) { + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, + false, MIGRATE_SYNC, MR_SYSCALL); + /* Steal pages from non-movable zone successfully? */ + if (!ret) + goto retry; + } + + putback_lru_pages(&pagelist); + /* Migration failed, we pin 0 page, tell caller the truth */ + return 0; +} +#else +inline int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas) +{ + return get_user_pages(tsk, mm, start, nr_pages, write, force, pages, + vmas); +} +#endif +EXPORT_SYMBOL(get_user_pages_non_movable); + /** * get_dump_page() - pin user page in memory while writing it to core dump * @addr: user address diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 383bdbb..1b7bd17 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, return ret ? 0 : -EBUSY; } +/** + * @private: 0 means page can be alloced from movable zone, otherwise forbidden + */ struct page *alloc_migrate_target(struct page *page, unsigned long private, int **resultp) { @@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private, if (PageHighMem(page)) gfp_mask |= __GFP_HIGHMEM; + if (unlikely(private != 0)) + gfp_mask &= ~__GFP_MOVABLE; return alloc_page(gfp_mask); } -- 1.7.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754055Ab3BEJXh (ORCPT ); Tue, 5 Feb 2013 04:23:37 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:40638 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1751257Ab3BEJX2 (ORCPT ); Tue, 5 Feb 2013 04:23:28 -0500 X-IronPort-AV: E=Sophos;i="4.84,603,1355068800"; d="scan'208";a="6691734" From: Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng Subject: [PATCH V2 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Date: Tue, 5 Feb 2013 17:21:53 +0800 Message-Id: <1360056113-14294-3-git-send-email-linfeng@cn.fujitsu.com> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 17:22:00, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 17:22:01, Serialize complete at 2013/02/05 17:22:01 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works as configed with CONFIG_MEMORY_HOTREMOVE, otherwise it falls back to use the old version of get_user_pages(). Cc: Benjamin LaHaise Cc: Alexander Viro Cc: Andrew Morton Cc: Jeff Moyer Cc: Minchan Kim Cc: Zach Brown Reviewed-by: Tang Chen Reviewed-by: Gu Zheng Signed-off-by: Lin Feng --- fs/aio.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 71f613c..f7a0d5c 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -138,8 +138,8 @@ static int aio_setup_ring(struct kioctx *ctx) } dprintk("mmap address: 0x%08lx\n", info->mmap_base); - info->nr_pages = get_user_pages(current, ctx->mm, - info->mmap_base, nr_pages, + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, + info->mmap_base, nr_pages, 1, 0, info->ring_pages, NULL); up_write(&ctx->mm->mmap_sem); -- 1.7.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753689Ab3BEJX2 (ORCPT ); Tue, 5 Feb 2013 04:23:28 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:40638 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1751806Ab3BEJXR (ORCPT ); Tue, 5 Feb 2013 04:23:17 -0500 X-IronPort-AV: E=Sophos;i="4.84,603,1355068800"; d="scan'208";a="6691732" From: Lin Feng To: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk Cc: khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Lin Feng Subject: [PATCH V2 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Date: Tue, 5 Feb 2013 17:21:51 +0800 Message-Id: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> X-Mailer: git-send-email 1.7.11.7 X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 17:21:58, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/05 17:21:59, Serialize complete at 2013/02/05 17:21:59 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently get_user_pages() always tries to allocate pages from movable zone, as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users of get_user_pages() is easy to pin user pages for a long time(for now we found that pages pinned as aio ring pages is such case), which is fatal for memory hotplug/remove framework. So the 1st patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. The 2nd patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works when configed with CONFIG_MEMORY_HOTREMOVE, otherwise it falls back to use the old version of get_user_pages(). --- ChangeLog v1->v2: Patch1: - Fix the negative return value bug pointed out by Andrew and other suggestions pointed out by Andrew and Jeff. Patch2: - Kill the CONFIG_MEMORY_HOTREMOVE dependence suggested by Jeff. --- Lin Feng (2): mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove fs/aio.c | 4 +- include/linux/mm.h | 3 ++ include/linux/mmzone.h | 4 ++ mm/memory.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 +++ 5 files changed, 97 insertions(+), 2 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754473Ab3BEMBv (ORCPT ); Tue, 5 Feb 2013 07:01:51 -0500 Received: from cantor2.suse.de ([195.135.220.15]:46400 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751211Ab3BEMBm (ORCPT ); Tue, 5 Feb 2013 07:01:42 -0500 Date: Tue, 5 Feb 2013 12:01:37 +0000 From: Mel Gorman To: Lin Feng Cc: akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130205120137.GG21389@suse.de> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: > get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > > This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > Cc: Andrew Morton > Cc: Mel Gorman > Cc: KAMEZAWA Hiroyuki > Cc: Yasuaki Ishimatsu > Cc: Jeff Moyer > Cc: Minchan Kim > Cc: Zach Brown > Reviewed-by: Tang Chen > Reviewed-by: Gu Zheng > Signed-off-by: Lin Feng I already had started the review of V1 before this was sent unfortunately. However, I think the feedback I gave for V1 is still valid so I'll wait for comments on that review before digging further. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756425Ab3BFAmj (ORCPT ); Tue, 5 Feb 2013 19:42:39 -0500 Received: from LGEMRELSE6Q.lge.com ([156.147.1.121]:47945 "EHLO LGEMRELSE6Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754976Ab3BFAmi (ORCPT ); Tue, 5 Feb 2013 19:42:38 -0500 X-AuditID: 9c930179-b7c24ae00000119c-55-5111a6fb5677 Date: Wed, 6 Feb 2013 09:42:34 +0900 From: Minchan Kim To: Mel Gorman Cc: Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130206004234.GD11197@blaptop> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130205120137.GG21389@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 05, 2013 at 12:01:37PM +0000, Mel Gorman wrote: > On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: > > get_user_pages() always tries to allocate pages from movable zone, which is not > > reliable to memory hotremove framework in some case. > > > > This patch introduces a new library function called get_user_pages_non_movable() > > to pin pages only from zone non-movable in memory. > > It's a wrapper of get_user_pages() but it makes sure that all pages come from > > non-movable zone via additional page migration. > > > > Cc: Andrew Morton > > Cc: Mel Gorman > > Cc: KAMEZAWA Hiroyuki > > Cc: Yasuaki Ishimatsu > > Cc: Jeff Moyer > > Cc: Minchan Kim > > Cc: Zach Brown > > Reviewed-by: Tang Chen > > Reviewed-by: Gu Zheng > > Signed-off-by: Lin Feng > > I already had started the review of V1 before this was sent > unfortunately. However, I think the feedback I gave for V1 is still > valid so I'll wait for comments on that review before digging further. Mel, Andrew Sorry for making noise if you already confirmed the direction but I have a concern about that. Because IMHO, we can't expect most of user for MEMORY_HOTPLUG will release pinned pages immediately. In addtion, MEMORY_HOTPLUG could be used for embedded system for reducing power by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know in advance they will use pinned pages long time or release in short time because it depends on some event like user's response which is very not predetermined. So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of failing migration by page count and then, investigate they are really using GUP and it's REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? Yes. it could be done but it would be rather trobulesome job. Even it couldn't be triggered during QE phase so that trouble doesn't end until all guys uses GUP_NM. Let's consider another case. Some driver pin the page in very short time so he decide to use GUP instead of GUP_NM but someday, someuser start to use the driver very often so although pinning time is very short, it could be forever pinning effect if the use calls it very often. In the end, we should change it with GUP_NM, again. IMHO, In future, we ends up changing most of GUP user with GUP_NM if CMA and MEMORY_HOTPLUG is available all over the world. So, what's wrong if we replace get_user_pages with get_user_pages_non_movable in MEMORY_HOTPLUG/CMA without exposing get_user_pages_non_movable? I mean this #ifdef CONFIG_MIGRATE_ISOLATE int get_user_pages() { return __get_user_pages_non_movable(); } #else int get_user_pages() { return old_get_user_pages(); } #endif IMHO, get_user_pages isn't performance sensitive function. If user was sensitive about it, he should have tried get_user_pages_fast. THP degradation by increasing MIGRATE_UNMOVABLE? Lin said most of GUP pages release the page in short so is it really problem? Even in embedded, we don't use THP yet but CMA and GUP call would be not too often but failing of CMA would be critical. I'd like to hear opinions. > > -- > Mel Gorman > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756892Ab3BFBX7 (ORCPT ); Tue, 5 Feb 2013 20:23:59 -0500 Received: from kanga.kvack.org ([205.233.56.17]:39613 "EHLO kanga.kvack.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755372Ab3BFBX4 (ORCPT ); Tue, 5 Feb 2013 20:23:56 -0500 X-Greylist: delayed 1899 seconds by postgrey-1.27 at vger.kernel.org; Tue, 05 Feb 2013 20:23:56 EST Date: Tue, 5 Feb 2013 19:52:17 -0500 From: Benjamin LaHaise To: Minchan Kim Cc: Mel Gorman , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130206005217.GJ20842@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130206004234.GD11197@blaptop> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 06, 2013 at 09:42:34AM +0900, Minchan Kim wrote: > THP degradation by increasing MIGRATE_UNMOVABLE? > Lin said most of GUP pages release the page in short so is it really problem? > Even in embedded, we don't use THP yet but CMA and GUP call would be not too often > but failing of CMA would be critical. > > I'd like to hear opinions. If aio was given a callback to migrate the pages on, it could just migrate the pages as needed. There's nothing fundamental preventing that approach. -ben -- "Thought is the essence of where you are now." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756984Ab3BFJ4b (ORCPT ); Wed, 6 Feb 2013 04:56:31 -0500 Received: from cantor2.suse.de ([195.135.220.15]:59049 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756369Ab3BFJ4Z (ORCPT ); Wed, 6 Feb 2013 04:56:25 -0500 Date: Wed, 6 Feb 2013 09:56:17 +0000 From: Mel Gorman To: Minchan Kim Cc: Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130206095617.GN21389@suse.de> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20130206004234.GD11197@blaptop> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 06, 2013 at 09:42:34AM +0900, Minchan Kim wrote: > On Tue, Feb 05, 2013 at 12:01:37PM +0000, Mel Gorman wrote: > > On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: > > > get_user_pages() always tries to allocate pages from movable zone, which is not > > > reliable to memory hotremove framework in some case. > > > > > > This patch introduces a new library function called get_user_pages_non_movable() > > > to pin pages only from zone non-movable in memory. > > > It's a wrapper of get_user_pages() but it makes sure that all pages come from > > > non-movable zone via additional page migration. > > > > > > Cc: Andrew Morton > > > Cc: Mel Gorman > > > Cc: KAMEZAWA Hiroyuki > > > Cc: Yasuaki Ishimatsu > > > Cc: Jeff Moyer > > > Cc: Minchan Kim > > > Cc: Zach Brown > > > Reviewed-by: Tang Chen > > > Reviewed-by: Gu Zheng > > > Signed-off-by: Lin Feng > > > > I already had started the review of V1 before this was sent > > unfortunately. However, I think the feedback I gave for V1 is still > > valid so I'll wait for comments on that review before digging further. > > Mel, Andrew > > Sorry for making noise if you already confirmed the direction but I have a concern > about that. I haven't confirmed any sort of direction, nor do I determine the direction for memory hot-remove which I'm only paying vague attention to. I stated a while ago that I think the use of ZONE_MOVABLE is a bad idea for "guaranteeing" memory hot-remove and is already going the "wrong" direction. That's just my opinion. This patch is about mitigating (but not solving) the problem of long-lived pins. In the general case, about all I could think of for that is that the kernel would have to warn the administrator what applications had pinned the memory and wait for the user to shut them down. To guarantee anything, it would be necessary for subsystems to implement a callback for migration to unpin pages, barrier operations until migration completes and pin the new pfns. > Because IMHO, we can't expect most of user for MEMORY_HOTPLUG will release > pinned pages immediately. Indeed not, but it's not really what this patch is about. This patch is about moving the pages before they get permanently pinned. It mitigates the problem but does not solve it because there is no guarantee that the driver pinning a page will flag it properly. > In addtion, MEMORY_HOTPLUG could be used for embedded system > for reducing power by PASR and some drivers in embedded could use GUP anytime and anywhere. > They can't know in advance they will use pinned pages long time or release in short time > because it depends on some event like user's response which is very not predetermined. True. This patch does not solve that problem. > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > failing migration by page count and then, investigate they are really using GUP and > it's REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? Within the context of this patch, that is their main option. Finding who is holding the pin is a problem. For userspace-pinned buffers it's straight-forward as rmap will identify what processes are holding the pin (page->list vmas->mm, lookup all tasks until p->mm == mm) and report that. For driver-related pins, it's not as straight-forward. I guess there could be callback to give meaningful information on it but no guarantee that drivers pinning pages will implement it. In that case all you could do was dump page->mapping and punt it at a kernel developer to figure out the responsible driver. This might be managable for memory hot-remove where there is an administator but may not work at all for embedded users. There is the possibility that callbacks could be introduced for migrate_unpin() and migrate_pin() that takes a list of PFN pairs (old,new). The unpin callback should release the old PFNs and barrier against any operations until the migrate_pfn() callback is called with the updated pfns to be repinned. Again it would fully depend on subsystems implementing it properly. The callback interface would be more robust but puts a lot more work on the driver side where your milage will vary. > Yes. it could be done but it would be rather trobulesome job. Even it couldn't be triggered > during QE phase so that trouble doesn't end until all guys uses GUP_NM. > Let's consider another case. Some driver pin the page in very short time > so he decide to use GUP instead of GUP_NM but someday, someuser start to use the driver > very often so although pinning time is very short, it could be forever pinning effect > if the use calls it very often. In the end, we should change it with GUP_NM, again. > IMHO, In future, we ends up changing most of GUP user with GUP_NM if CMA and MEMORY_HOTPLUG > is available all over the world. > Same thing, callbacks to unpin and barrier would handle such a case by effectively freezing the driver or subsystem responsible for the page. > So, what's wrong if we replace get_user_pages with get_user_pages_non_movable > in MEMORY_HOTPLUG/CMA without exposing get_user_pages_non_movable? > > I mean this > > #ifdef CONFIG_MIGRATE_ISOLATE > int get_user_pages() > { > return __get_user_pages_non_movable(); > } > #else > int get_user_pages() > { > return old_get_user_pages(); > } > #endif > That will migrate everything out of ZONE_MOVABLE every time it's pinned. One consequence is that direct IO can never use ZONE_MOVABLE on these systems. It'll create a variation of the lowmem exhaustion problem. > IMHO, get_user_pages isn't performance sensitive function. If user was sensitive > about it, he should have tried get_user_pages_fast. That opens a different cans of works. get_user_pages is part of the gup_fast slowpath. > THP degradation by increasing MIGRATE_UNMOVABLE? The patch should not be converting MIGRATE_MOVABLE requests to MIGRATE_UNMOVABLE. I covered this in the review of v1. > Lin said most of GUP pages release the page in short so is it really problem? > Even in embedded, we don't use THP yet but CMA and GUP call would be not too often > but failing of CMA would be critical. > To guarantee CMA can migrate pages pinned by drivers I think you need migrate-related callsbacks to unpin, barrier the driver until migration completes and repin. I do not know, or at least have no heard, of anyone working on such a scheme. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759440Ab3BHCcn (ORCPT ); Thu, 7 Feb 2013 21:32:43 -0500 Received: from LGEMRELSE6Q.lge.com ([156.147.1.121]:43320 "EHLO LGEMRELSE6Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758237Ab3BHCcl (ORCPT ); Thu, 7 Feb 2013 21:32:41 -0500 X-AuditID: 9c930179-b7c24ae00000119c-0d-511463c6da71 Date: Fri, 8 Feb 2013 11:32:37 +0900 From: Minchan Kim To: Mel Gorman Cc: Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130208023237.GK11197@blaptop> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130206095617.GN21389@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mel, On Wed, Feb 06, 2013 at 09:56:17AM +0000, Mel Gorman wrote: > On Wed, Feb 06, 2013 at 09:42:34AM +0900, Minchan Kim wrote: > > On Tue, Feb 05, 2013 at 12:01:37PM +0000, Mel Gorman wrote: > > > On Tue, Feb 05, 2013 at 05:21:52PM +0800, Lin Feng wrote: > > > > get_user_pages() always tries to allocate pages from movable zone, which is not > > > > reliable to memory hotremove framework in some case. > > > > > > > > This patch introduces a new library function called get_user_pages_non_movable() > > > > to pin pages only from zone non-movable in memory. > > > > It's a wrapper of get_user_pages() but it makes sure that all pages come from > > > > non-movable zone via additional page migration. > > > > > > > > Cc: Andrew Morton > > > > Cc: Mel Gorman > > > > Cc: KAMEZAWA Hiroyuki > > > > Cc: Yasuaki Ishimatsu > > > > Cc: Jeff Moyer > > > > Cc: Minchan Kim > > > > Cc: Zach Brown > > > > Reviewed-by: Tang Chen > > > > Reviewed-by: Gu Zheng > > > > Signed-off-by: Lin Feng > > > > > > I already had started the review of V1 before this was sent > > > unfortunately. However, I think the feedback I gave for V1 is still > > > valid so I'll wait for comments on that review before digging further. > > > > Mel, Andrew > > > > Sorry for making noise if you already confirmed the direction but I have a concern > > about that. > > I haven't confirmed any sort of direction, nor do I determine the > direction for memory hot-remove which I'm only paying vague attention to. > I stated a while ago that I think the use of ZONE_MOVABLE is a bad idea > for "guaranteeing" memory hot-remove and is already going the "wrong" > direction. That's just my opinion. > > This patch is about mitigating (but not solving) the problem of long-lived > pins. In the general case, about all I could think of for that is that the Agreed. > kernel would have to warn the administrator what applications had pinned > the memory and wait for the user to shut them down. To guarantee anything, > it would be necessary for subsystems to implement a callback for migration > to unpin pages, barrier operations until migration completes and pin the > new pfns. It could be applied for SUBSYSTEM but it's very hard for all DRIVER developer, and I doubt we can give them a common template most of driver developers can reuse it. > > > Because IMHO, we can't expect most of user for MEMORY_HOTPLUG will release > > pinned pages immediately. > > Indeed not, but it's not really what this patch is about. This patch is > about moving the pages before they get permanently pinned. It mitigates > the problem but does not solve it because there is no guarantee that the > driver pinning a page will flag it properly. True. And I doubt what memory-hotplug guys really want is best effort, not guarantee. Anway, CMA want to guarantee, even low latency and I hope this patch solves both memory-hotplug and CMA solve the problem. > > > In addtion, MEMORY_HOTPLUG could be used for embedded system > > for reducing power by PASR and some drivers in embedded could use GUP anytime and anywhere. > > They can't know in advance they will use pinned pages long time or release in short time > > because it depends on some event like user's response which is very not predetermined. > > True. This patch does not solve that problem. > > > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > > failing migration by page count and then, investigate they are really using GUP and > > it's REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? > > Within the context of this patch, that is their main option. Finding > who is holding the pin is a problem. For userspace-pinned buffers it's > straight-forward as rmap will identify what processes are holding the > pin (page->list vmas->mm, lookup all tasks until p->mm == mm) and report > that. For driver-related pins, it's not as straight-forward. I guess there True. > could be callback to give meaningful information on it but no guarantee > that drivers pinning pages will implement it. In that case all you could do Nod. > was dump page->mapping and punt it at a kernel developer to figure out the > responsible driver. This might be managable for memory hot-remove where > there is an administator but may not work at all for embedded users. Yeab. Even there are proprietary modules in embedded, we can't see soruce code. > > There is the possibility that callbacks could be introduced for > migrate_unpin() and migrate_pin() that takes a list of PFN pairs > (old,new). The unpin callback should release the old PFNs and barrier > against any operations until the migrate_pfn() callback is called with > the updated pfns to be repinned. Again it would fully depend on subsystems > implementing it properly. > > The callback interface would be more robust but puts a lot more work on > the driver side where your milage will vary. True. > > > Yes. it could be done but it would be rather trobulesome job. Even it couldn't be triggered > > during QE phase so that trouble doesn't end until all guys uses GUP_NM. > > Let's consider another case. Some driver pin the page in very short time > > so he decide to use GUP instead of GUP_NM but someday, someuser start to use the driver > > very often so although pinning time is very short, it could be forever pinning effect > > if the use calls it very often. In the end, we should change it with GUP_NM, again. > > IMHO, In future, we ends up changing most of GUP user with GUP_NM if CMA and MEMORY_HOTPLUG > > is available all over the world. > > > > Same thing, callbacks to unpin and barrier would handle such a case by > effectively freezing the driver or subsystem responsible for the page. > > > So, what's wrong if we replace get_user_pages with get_user_pages_non_movable > > in MEMORY_HOTPLUG/CMA without exposing get_user_pages_non_movable? > > > > I mean this > > > > #ifdef CONFIG_MIGRATE_ISOLATE > > int get_user_pages() > > { > > return __get_user_pages_non_movable(); > > } > > #else > > int get_user_pages() > > { > > return old_get_user_pages(); > > } > > #endif > > > > That will migrate everything out of ZONE_MOVABLE every time it's pinned. > One consequence is that direct IO can never use ZONE_MOVABLE on these > systems. It'll create a variation of the lowmem exhaustion problem. For example, there is 4G highmem zone and half of it is movable zone. In thit case, we can use extra 2G highmem zone space instead of lowmem. But I agree it could end up pinning many pages of lowmem so the problem would happens. IMHO, it should be trade-off for using MEMORY-HOTPLUG/CMA? > > > IMHO, get_user_pages isn't performance sensitive function. If user was sensitive > > about it, he should have tried get_user_pages_fast. > > That opens a different cans of works. get_user_pages is part of the > gup_fast slowpath. > > > THP degradation by increasing MIGRATE_UNMOVABLE? > > The patch should not be converting MIGRATE_MOVABLE requests to > MIGRATE_UNMOVABLE. I covered this in the review of v1. I guess memory-hotplug guys want to use GUP_NM for long-time pin user. So doesn't it make sense to migrate the page into MIGRATE_UNMOVABLE? But I'm not sure GUP_NM's semantic. > > > Lin said most of GUP pages release the page in short so is it really problem? > > Even in embedded, we don't use THP yet but CMA and GUP call would be not too often > > but failing of CMA would be critical. > > > > To guarantee CMA can migrate pages pinned by drivers I think you need > migrate-related callsbacks to unpin, barrier the driver until migration > completes and repin. I agree it's a ideal solution when we consider in future but as you already mentioned, it's not easy for all drivers. In fact, I don't want to insist on my opinion for CMA because I guess CMA design is not good from the beginning. I just posted my concern and want to discuss to solve the problem but if there are not plain solution now, let me pass the decision to maintainer. Thanks for sharing your opinion, Mel! > > I do not know, or at least have no heard, of anyone working on such a > scheme. > > -- > Mel Gorman > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935285Ab3BTMkl (ORCPT ); Wed, 20 Feb 2013 07:40:41 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:60747 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S932945Ab3BTMkk (ORCPT ); Wed, 20 Feb 2013 07:40:40 -0500 X-IronPort-AV: E=Sophos;i="4.84,702,1355068800"; d="scan'208";a="6736913" Message-ID: <5124C3E6.1060108@cn.fujitsu.com> Date: Wed, 20 Feb 2013 20:39:02 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Wanpeng Li CC: akpm@linux-foundation.org, mgorman@suse.de, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130220113757.GA10124@hacker.(null)> In-Reply-To: <20130220113757.GA10124@hacker.(null)> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/20 20:39:56, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/20 20:39:57, Serialize complete at 2013/02/20 20:39:57 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Wanpeng, On 02/20/2013 07:37 PM, Wanpeng Li wrote: >> + * This function first calls get_user_pages() to get the candidate pages, and >> >+ * then check to ensure all pages are from non movable zone. Otherwise migrate > How about "Otherwise migrate candidate pages which have already been > isolated to non movable zone."? > Which is just what the code does, I'm feeling that it's too detailed to be proper :( Do we have to comment it like that detailedly? thanks, linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752357Ab3EMJIx (ORCPT ); Mon, 13 May 2013 05:08:53 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:35116 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1751564Ab3EMJIv (ORCPT ); Mon, 13 May 2013 05:08:51 -0400 X-IronPort-AV: E=Sophos;i="4.87,661,1363104000"; d="scan'208";a="7255668" Message-ID: <5190AE4F.4000103@cn.fujitsu.com> Date: Mon, 13 May 2013 17:11:43 +0800 From: Tang Chen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Mel Gorman CC: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> In-Reply-To: <20130206095617.GN21389@suse.de> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/13 17:07:43, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/13 17:07:47, Serialize complete at 2013/05/13 17:07:47 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mel, On 02/06/2013 05:56 PM, Mel Gorman wrote: > > There is the possibility that callbacks could be introduced for > migrate_unpin() and migrate_pin() that takes a list of PFN pairs > (old,new). The unpin callback should release the old PFNs and barrier > against any operations until the migrate_pfn() callback is called with > the updated pfns to be repinned. Again it would fully depend on subsystems > implementing it properly. > > The callback interface would be more robust but puts a lot more work on > the driver side where your milage will vary. > I'm very interested in the "callback" way you said. For memory hot-remove case, the aio pages are pined in memory and making the pages cannot be offlined, furthermore, the pages cannot be removed. IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio subsystem, and call them when hot-remove code tries to offline pages, right ? If so, I'm wondering where should we put this callback pointers ? In struct page ? It has been a long time since this topic was discussed. But to solve this problem cleanly for hotplug guys and CMA guys, please give some more comments. Thanks. :) > > To guarantee CMA can migrate pages pinned by drivers I think you need > migrate-related callsbacks to unpin, barrier the driver until migration > completes and repin. > > I do not know, or at least have no heard, of anyone working on such a > scheme. > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751448Ab3EMJTK (ORCPT ); Mon, 13 May 2013 05:19:10 -0400 Received: from cantor2.suse.de ([195.135.220.15]:56043 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750912Ab3EMJTJ (ORCPT ); Mon, 13 May 2013 05:19:09 -0400 Date: Mon, 13 May 2013 10:19:02 +0100 From: Mel Gorman To: Tang Chen Cc: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130513091902.GP11497@suse.de> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <5190AE4F.4000103@cn.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote: > Hi Mel, > > On 02/06/2013 05:56 PM, Mel Gorman wrote: > > > >There is the possibility that callbacks could be introduced for > >migrate_unpin() and migrate_pin() that takes a list of PFN pairs > >(old,new). The unpin callback should release the old PFNs and barrier > >against any operations until the migrate_pfn() callback is called with > >the updated pfns to be repinned. Again it would fully depend on subsystems > >implementing it properly. > > > >The callback interface would be more robust but puts a lot more work on > >the driver side where your milage will vary. > > > > I'm very interested in the "callback" way you said. > > For memory hot-remove case, the aio pages are pined in memory and making > the pages cannot be offlined, furthermore, the pages cannot be removed. > > IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio > subsystem, and call them when hot-remove code tries to offline > pages, right ? > > If so, I'm wondering where should we put this callback pointers ? > In struct page ? > No, I would expect the callbacks to be part the address space operations which can be found via page->mapping. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754844Ab3EMOyp (ORCPT ); Mon, 13 May 2013 10:54:45 -0400 Received: from mx1.redhat.com ([209.132.183.28]:55209 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751052Ab3EMOyn (ORCPT ); Mon, 13 May 2013 10:54:43 -0400 From: Jeff Moyer To: Benjamin LaHaise Cc: Mel Gorman , Tang Chen , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> X-PGP-KeyID: 1F78E1B4 X-PGP-CertKey: F6FE 280D 8293 F72C 65FD 5A58 1FF8 A7CA 1F78 E1B4 X-PCLoadLetter: What the f**k does that mean? Date: Mon, 13 May 2013 10:54:03 -0400 In-Reply-To: <20130513143757.GP31899@kvack.org> (Benjamin LaHaise's message of "Mon, 13 May 2013 10:37:57 -0400") Message-ID: User-Agent: Gnus/5.110011 (No Gnus v0.11) Emacs/23.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Benjamin LaHaise writes: > On Mon, May 13, 2013 at 10:19:02AM +0100, Mel Gorman wrote: >> On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote: > ... >> > If so, I'm wondering where should we put this callback pointers ? >> > In struct page ? >> > >> >> No, I would expect the callbacks to be part the address space operations >> which can be found via page->mapping. > > If someone adds those callbacks and provides a means for testing them, > it would be pretty trivial to change the aio code to migrate its pinned > pages on demand. How do you propose to move the ring pages? Cheers, Jeff From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754868Ab3EMPBt (ORCPT ); Mon, 13 May 2013 11:01:49 -0400 Received: from kanga.kvack.org ([205.233.56.17]:53596 "EHLO kanga.kvack.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752808Ab3EMPBr (ORCPT ); Mon, 13 May 2013 11:01:47 -0400 X-Greylist: delayed 1429 seconds by postgrey-1.27 at vger.kernel.org; Mon, 13 May 2013 11:01:47 EDT Date: Mon, 13 May 2013 11:01:47 -0400 From: Benjamin LaHaise To: Jeff Moyer Cc: Mel Gorman , Tang Chen , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130513150147.GQ31899@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: > How do you propose to move the ring pages? It's the same problem as doing a TLB shootdown: flush the old pages from userspace's mapping, copy any existing data to the new pages, then repopulate the page tables. It will likely require the addition of address_space_operations for the mapping, but that's not too hard to do. -ben -- "Thought is the essence of where you are now." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754899Ab3EMPE1 (ORCPT ); Mon, 13 May 2013 11:04:27 -0400 Received: from kanga.kvack.org ([205.233.56.17]:53961 "EHLO kanga.kvack.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751697Ab3EMPEZ (ORCPT ); Mon, 13 May 2013 11:04:25 -0400 Date: Mon, 13 May 2013 10:37:57 -0400 From: Benjamin LaHaise To: Mel Gorman Cc: Tang Chen , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130513143757.GP31899@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130513091902.GP11497@suse.de> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 13, 2013 at 10:19:02AM +0100, Mel Gorman wrote: > On Mon, May 13, 2013 at 05:11:43PM +0800, Tang Chen wrote: ... > > If so, I'm wondering where should we put this callback pointers ? > > In struct page ? > > > > No, I would expect the callbacks to be part the address space operations > which can be found via page->mapping. If someone adds those callbacks and provides a means for testing them, it would be pretty trivial to change the aio code to migrate its pinned pages on demand. -ben -- "Thought is the essence of where you are now." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756234Ab3ENBWS (ORCPT ); Mon, 13 May 2013 21:22:18 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:21995 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1756205Ab3ENBWM (ORCPT ); Mon, 13 May 2013 21:22:12 -0400 X-IronPort-AV: E=Sophos;i="4.87,666,1363104000"; d="scan'208";a="7260628" Message-ID: <5191926A.2090608@cn.fujitsu.com> Date: Tue, 14 May 2013 09:24:58 +0800 From: Tang Chen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Benjamin LaHaise , Jeff Moyer , Mel Gorman CC: Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> <20130513150147.GQ31899@kvack.org> In-Reply-To: <20130513150147.GQ31899@kvack.org> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/14 09:20:57, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/14 09:20:58, Serialize complete at 2013/05/14 09:20:58 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mel, Benjamin, Jeff, On 05/13/2013 11:01 PM, Benjamin LaHaise wrote: > On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: >> How do you propose to move the ring pages? > > It's the same problem as doing a TLB shootdown: flush the old pages from > userspace's mapping, copy any existing data to the new pages, then > repopulate the page tables. It will likely require the addition of > address_space_operations for the mapping, but that's not too hard to do. > I think we add migrate_unpin() callback to decrease page->count if necessary, and migrate the page to a new page, and add migrate_pin() callback to pin the new page again. The migrate procedure will work just as before. We use callbacks to decrease the page->count before migration starts, and increase it when the migration is done. And migrate_pin() and migrate_unpin() callbacks will be added to struct address_space_operations. Is that right ? If so, I'll be working on it. Thanks. :) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751427Ab3ENEM3 (ORCPT ); Tue, 14 May 2013 00:12:29 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:15616 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1750839Ab3ENEM1 (ORCPT ); Tue, 14 May 2013 00:12:27 -0400 X-IronPort-AV: E=Sophos;i="4.87,667,1363104000"; d="scan'208";a="7262498" Message-ID: <5191B5B3.7080406@cn.fujitsu.com> Date: Tue, 14 May 2013 11:55:31 +0800 From: Tang Chen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Mel Gorman CC: Minchan Kim , Lin Feng , akpm@linux-foundation.org, bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> In-Reply-To: <20130513091902.GP11497@suse.de> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/14 11:51:30, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/14 11:51:59, Serialize complete at 2013/05/14 11:51:59 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mel, On 05/13/2013 05:19 PM, Mel Gorman wrote: >> For memory hot-remove case, the aio pages are pined in memory and making >> the pages cannot be offlined, furthermore, the pages cannot be removed. >> >> IIUC, you mean implement migrate_unpin() and migrate_pin() callbacks in aio >> subsystem, and call them when hot-remove code tries to offline >> pages, right ? >> >> If so, I'm wondering where should we put this callback pointers ? >> In struct page ? >> > > No, I would expect the callbacks to be part the address space operations > which can be found via page->mapping. > Two more problems I don't quite understand. 1. For an anonymous page, it has no address_space, and no address space operation. But the aio ring problem just happened when dealing with anonymous pages. Please refer to: (https://lkml.org/lkml/2012/11/29/69) If we put the the callbacks in page->mapping->a_ops, the anonymous pages won't be able to use them. And we cannot give a default callback because the situation we are dealing with is a special situation. So where to put the callback for anonymous pages ? 2. How to find out the reason why page->count != 1 in migrate_page_move_mapping() ? In the problem we are dealing with, get_user_pages() is called to pin the pages in memory. And the pages are migratable. So we want to decrease the page->count. But get_user_pages() is not the only reason leading to page->count increased. How can I know when should decrease teh page->count or when should not ? The way I can figure out is to assign the callback pointer in get_user_pages() because it is get_user_pages() who pins the pages. Thanks. :) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757582Ab3ENN6y (ORCPT ); Tue, 14 May 2013 09:58:54 -0400 Received: from kanga.kvack.org ([205.233.56.17]:39957 "EHLO kanga.kvack.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753439Ab3ENN6x (ORCPT ); Tue, 14 May 2013 09:58:53 -0400 Date: Tue, 14 May 2013 09:58:50 -0400 From: Benjamin LaHaise To: Tang Chen Cc: Jeff Moyer , Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Message-ID: <20130514135850.GG13845@kvack.org> References: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> <20130513150147.GQ31899@kvack.org> <5191926A.2090608@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5191926A.2090608@cn.fujitsu.com> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote: > Hi Mel, Benjamin, Jeff, > > On 05/13/2013 11:01 PM, Benjamin LaHaise wrote: > >On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: > >>How do you propose to move the ring pages? > > > >It's the same problem as doing a TLB shootdown: flush the old pages from > >userspace's mapping, copy any existing data to the new pages, then > >repopulate the page tables. It will likely require the addition of > >address_space_operations for the mapping, but that's not too hard to do. > > > > I think we add migrate_unpin() callback to decrease page->count if > necessary, > and migrate the page to a new page, and add migrate_pin() callback to pin > the new page again. You can't just decrease the page count for this to work. The pages are pinned because aio_complete() can occur at any time and needs to have a place to write the completion events. When changing pages, aio has to take the appropriate lock when changing one page for another. > The migrate procedure will work just as before. We use callbacks to > decrease > the page->count before migration starts, and increase it when the migration > is done. > > And migrate_pin() and migrate_unpin() callbacks will be added to > struct address_space_operations. I think the existing migratepage operation in address_space_operations can be used. Does it get called when hot unplug occurs? That is: is testing with the migrate_pages syscall similar enough to the memory removal case? -ben > Is that right ? > > If so, I'll be working on it. > > Thanks. :) -- "Thought is the essence of where you are now." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755497Ab3EOCGX (ORCPT ); Tue, 14 May 2013 22:06:23 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:25718 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1752001Ab3EOCGV (ORCPT ); Tue, 14 May 2013 22:06:21 -0400 X-IronPort-AV: E=Sophos;i="4.87,675,1363104000"; d="scan'208";a="7271342" Message-ID: <5192EE40.7060407@cn.fujitsu.com> Date: Wed, 15 May 2013 10:09:04 +0800 From: Tang Chen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Benjamin LaHaise , Mel Gorman CC: Jeff Moyer , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> <20130513150147.GQ31899@kvack.org> <5191926A.2090608@cn.fujitsu.com> <20130514135850.GG13845@kvack.org> In-Reply-To: <20130514135850.GG13845@kvack.org> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/15 10:05:01, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/15 10:05:07, Serialize complete at 2013/05/15 10:05:07 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Benjamin, Mel, Please see below. On 05/14/2013 09:58 PM, Benjamin LaHaise wrote: > On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote: >> Hi Mel, Benjamin, Jeff, >> >> On 05/13/2013 11:01 PM, Benjamin LaHaise wrote: >>> On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: >>>> How do you propose to move the ring pages? >>> >>> It's the same problem as doing a TLB shootdown: flush the old pages from >>> userspace's mapping, copy any existing data to the new pages, then >>> repopulate the page tables. It will likely require the addition of >>> address_space_operations for the mapping, but that's not too hard to do. >>> >> >> I think we add migrate_unpin() callback to decrease page->count if >> necessary, >> and migrate the page to a new page, and add migrate_pin() callback to pin >> the new page again. > > You can't just decrease the page count for this to work. The pages are > pinned because aio_complete() can occur at any time and needs to have a > place to write the completion events. When changing pages, aio has to > take the appropriate lock when changing one page for another. In aio_complete(), aio_complete() { ...... spin_lock_irqsave(&ctx->completion_lock, flags); //write the completion event. spin_unlock_irqrestore(&ctx->completion_lock, flags); ...... } So for this problem, I think we can hold ctx->completion_lock in the aio callbacks to prevent aio subsystem accessing pages who are being migrated. > >> The migrate procedure will work just as before. We use callbacks to >> decrease >> the page->count before migration starts, and increase it when the migration >> is done. >> >> And migrate_pin() and migrate_unpin() callbacks will be added to >> struct address_space_operations. > > I think the existing migratepage operation in address_space_operations can > be used. Does it get called when hot unplug occurs? That is: is testing > with the migrate_pages syscall similar enough to the memory removal case? > But as I said, for anonymous pages such as aio ring buffer, they don't have address_space_operations. So where should we put the callbacks' pointers ? Add something like address_space_operations to struct anon_vma ? Thanks. :) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756627Ab3EOHSt (ORCPT ); Wed, 15 May 2013 03:18:49 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:47113 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1755108Ab3EOHSr (ORCPT ); Wed, 15 May 2013 03:18:47 -0400 X-IronPort-AV: E=Sophos;i="4.87,675,1363104000"; d="scan'208";a="7274242" Message-ID: <5193377E.30102@cn.fujitsu.com> Date: Wed, 15 May 2013 15:21:34 +0800 From: Tang Chen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Benjamin LaHaise , Mel Gorman CC: Jeff Moyer , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <20130513143757.GP31899@kvack.org> <20130513150147.GQ31899@kvack.org> <5191926A.2090608@cn.fujitsu.com> <20130514135850.GG13845@kvack.org> <5192EE40.7060407@cn.fujitsu.com> In-Reply-To: <5192EE40.7060407@cn.fujitsu.com> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/15 15:17:31, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/15 15:17:33, Serialize complete at 2013/05/15 15:17:33 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Benjamin, Mel, On 05/15/2013 10:09 AM, Tang Chen wrote: > Hi Benjamin, Mel, > > Please see below. > > On 05/14/2013 09:58 PM, Benjamin LaHaise wrote: >> On Tue, May 14, 2013 at 09:24:58AM +0800, Tang Chen wrote: >>> Hi Mel, Benjamin, Jeff, >>> >>> On 05/13/2013 11:01 PM, Benjamin LaHaise wrote: >>>> On Mon, May 13, 2013 at 10:54:03AM -0400, Jeff Moyer wrote: >>>>> How do you propose to move the ring pages? >>>> >>>> It's the same problem as doing a TLB shootdown: flush the old pages >>>> from >>>> userspace's mapping, copy any existing data to the new pages, then >>>> repopulate the page tables. It will likely require the addition of >>>> address_space_operations for the mapping, but that's not too hard to >>>> do. >>>> >>> >>> I think we add migrate_unpin() callback to decrease page->count if >>> necessary, >>> and migrate the page to a new page, and add migrate_pin() callback to >>> pin >>> the new page again. >> >> You can't just decrease the page count for this to work. The pages are >> pinned because aio_complete() can occur at any time and needs to have a >> place to write the completion events. When changing pages, aio has to >> take the appropriate lock when changing one page for another. > > In aio_complete(), > > aio_complete() { > ...... > spin_lock_irqsave(&ctx->completion_lock, flags); > //write the completion event. > spin_unlock_irqrestore(&ctx->completion_lock, flags); > ...... > } > > So for this problem, I think we can hold kioctx->completion_lock in the aio > callbacks to prevent aio subsystem accessing pages who are being migrated. > Another problem here is: We intend to call these callbacks in the page migrate path, and we need to know which lock to hold. But there is no way for migrate path to know this info. The migrate path is common for all kinds of pages, so we cannot pass any specific parameter to the callbacks in migrate path. When we get a page, we cannot get any kioctx info from the page. So how can the callback know which lock to require without any parameter ? Or do we have any other way to do so ? Would you please give some more advice about this ? BTW, we also need to update kioctx->ring_pages. Thanks. :) >> >>> The migrate procedure will work just as before. We use callbacks to >>> decrease >>> the page->count before migration starts, and increase it when the >>> migration >>> is done. >>> >>> And migrate_pin() and migrate_unpin() callbacks will be added to >>> struct address_space_operations. >> >> I think the existing migratepage operation in address_space_operations >> can >> be used. Does it get called when hot unplug occurs? That is: is testing >> with the migrate_pages syscall similar enough to the memory removal case? >> > > But as I said, for anonymous pages such as aio ring buffer, they don't have > address_space_operations. So where should we put the callbacks' pointers ? > > Add something like address_space_operations to struct anon_vma ? > > Thanks. :) > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755255Ab3EQAXv (ORCPT ); Thu, 16 May 2013 20:23:51 -0400 Received: from kanga.kvack.org ([205.233.56.17]:45856 "EHLO kanga.kvack.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755222Ab3EQAXu (ORCPT ); Thu, 16 May 2013 20:23:50 -0400 Date: Thu, 16 May 2013 20:23:49 -0400 From: Benjamin LaHaise To: Tang Chen Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130517002349.GI1008@kvack.org> References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5194748A.5070700@cn.fujitsu.com> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 16, 2013 at 01:54:18PM +0800, Tang Chen wrote: ... > OK, I'll try to figure out a proper place to put the callbacks. > But I think we need to add something new to struct page. I'm just > not sure if it is OK. Maybe we can discuss more about it when I send > a RFC patch. ... I ended up working on this a bit today, and managed to cobble together something that somewhat works -- please see the patch below. It still is not completely tested, and it has a rather nasty bug owing to the fact that the file descriptors returned by anon_inode_getfile() all share the same inode (read: more than one instance of aio does not work), but it shows the basic idea. Also, bad things probably happen if someone does an mremap() on the aio ring buffer. I'll polish this off sometime next week after the long weekend if noone beats me to it. -ben -- "Thought is the essence of where you are now." fs/aio.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/migrate.h | 3 + mm/migrate.c | 2 - 3 files changed, 96 insertions(+), 4 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index c5b1a8c..dbad23e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -35,6 +35,9 @@ #include #include #include +#include +#include +#include #include #include @@ -108,6 +111,7 @@ struct kioctx { } ____cacheline_aligned_in_smp; struct page *internal_pages[AIO_RING_PAGES]; + struct file *ctx_file; }; /*------ sysctl variables----*/ @@ -146,8 +150,59 @@ static void aio_free_ring(struct kioctx *ctx) if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) kfree(ctx->ring_pages); + + if (ctx->ctx_file) { + truncate_setsize(ctx->ctx_file->f_inode, 0); + fput(ctx->ctx_file); + ctx->ctx_file = NULL; + } +} + +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma->vm_ops = &generic_file_vm_ops; + return 0; +} + +static const struct file_operations aio_ctx_fops = { + .mmap = aio_ctx_mmap, +}; + +static int aio_set_page_dirty(struct page *page) +{ + return 0; +} + +static int aio_migratepage(struct address_space *mapping, struct page *new, + struct page *old, enum migrate_mode mode) +{ + struct kioctx *ctx = mapping->private_data; + unsigned long flags; + unsigned idx = old->index; + int rc; + + BUG_ON(PageWriteback(old)); /* Writeback must be complete */ + put_page(old); + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); + if (rc != MIGRATEPAGE_SUCCESS) { + get_page(old); + return rc; + } + get_page(new); + + spin_lock_irqsave(&ctx->completion_lock, flags); + migrate_page_copy(new, old); + ctx->ring_pages[idx] = new; + spin_unlock_irqrestore(&ctx->completion_lock, flags); + + return MIGRATEPAGE_SUCCESS; } +static const struct address_space_operations aio_ctx_aops = { + .set_page_dirty = aio_set_page_dirty, + .migratepage = aio_migratepage, +}; + static int aio_setup_ring(struct kioctx *ctx) { struct aio_ring *ring; @@ -155,6 +210,7 @@ static int aio_setup_ring(struct kioctx *ctx) struct mm_struct *mm = current->mm; unsigned long size, populate; int nr_pages; + int i; /* Compensate for the ring buffer's head/tail overlap entry */ nr_events += 2; /* 1 is required, 2 for good luck */ @@ -166,6 +222,31 @@ static int aio_setup_ring(struct kioctx *ctx) if (nr_pages < 0) return -EINVAL; + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); + if (IS_ERR(ctx->ctx_file)) { + ctx->ctx_file = NULL; + return -EAGAIN; + } + ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops; + ctx->ctx_file->f_inode->i_mapping->private_data = ctx; + ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages; + + for (i=0; ictx_file->f_inode->i_mapping, + i, GFP_KERNEL); + if (!page) { + break; + } + ptr = kmap(page); + clear_page(ptr); + kunmap(page); + SetPageUptodate(page); + SetPageDirty(page); + unlock_page(page); + } + nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); ctx->nr_events = 0; @@ -180,20 +261,25 @@ static int aio_setup_ring(struct kioctx *ctx) ctx->mmap_size = nr_pages * PAGE_SIZE; pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size); down_write(&mm->mmap_sem); - ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size, + ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size, PROT_READ|PROT_WRITE, - MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate); + MAP_SHARED|MAP_POPULATE, 0, + &populate); if (IS_ERR((void *)ctx->mmap_base)) { up_write(&mm->mmap_sem); ctx->mmap_size = 0; aio_free_ring(ctx); return -EAGAIN; } + up_write(&mm->mmap_sem); + mm_populate(ctx->mmap_base, populate); pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, 1, 0, ctx->ring_pages, NULL); - up_write(&mm->mmap_sem); + for (i=0; inr_pages; i++) { + put_page(ctx->ring_pages[i]); + } if (unlikely(ctx->nr_pages != nr_pages)) { aio_free_ring(ctx); @@ -403,6 +489,8 @@ out_cleanup: err = -EAGAIN; aio_free_ring(ctx); out_freectx: + if (ctx->ctx_file) + fput(ctx->ctx_file); kmem_cache_free(kioctx_cachep, ctx); pr_debug("error allocating ioctx %d\n", err); return ERR_PTR(err); @@ -852,6 +940,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) ioctx = ioctx_alloc(nr_events); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { + ctx = ioctx->user_id; ret = put_user(ioctx->user_id, ctxp); if (ret) kill_ioctx(ioctx); diff --git a/include/linux/migrate.h b/include/linux/migrate.h index a405d3dc..b6f3289 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm, extern void migrate_page_copy(struct page *newpage, struct page *page); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); +extern int migrate_page_move_mapping(struct address_space *mapping, + struct page *newpage, struct page *page, + struct buffer_head *head, enum migrate_mode mode); #else static inline void putback_lru_pages(struct list_head *l) {} diff --git a/mm/migrate.c b/mm/migrate.c index 27ed225..ac9c3a9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, * 2 for pages with a mapping * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. */ -static int migrate_page_move_mapping(struct address_space *mapping, +int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, struct buffer_head *head, enum migrate_mode mode) { From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752927Ab3EQD0G (ORCPT ); Thu, 16 May 2013 23:26:06 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:28042 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1752063Ab3EQD0F (ORCPT ); Thu, 16 May 2013 23:26:05 -0400 X-IronPort-AV: E=Sophos;i="4.87,689,1363104000"; d="scan'208";a="7291138" Message-ID: <5195A3F4.70803@cn.fujitsu.com> Date: Fri, 17 May 2013 11:28:52 +0800 From: Tang Chen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Benjamin LaHaise CC: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <1360056113-14294-1-git-send-email-linfeng@cn.fujitsu.com> <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> In-Reply-To: <20130517002349.GI1008@kvack.org> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/17 11:24:44, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/17 11:24:48, Serialize complete at 2013/05/17 11:24:48 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Benjamin, Thank you very much for your idea. :) I have no objection to your idea, but seeing from your patch, this only works for aio subsystem because you changed the way to allocate the aio ring pages, with a file mapping. So far as I know, not only aio, but also other subsystems, such CMA, will also have problem like this. The page cannot be migrated because it is pinned in memory. So I think we should work out a common way to solve how to migrate pinned pages. I'm working in the way Mel has said, migrate_unpin() and migrate_pin() callbacks. But as you saw, I met some problems, like I don't where to put these two callbacks. And discussed with you guys, I want to try this: 1. Add a new member to struct page, used to remember the pin holders of this page, including the pin and unpin callbacks and the necessary data. This is more like a callback chain. (I'm worry about this step, I'm not sure if it is good enough. After all, we need a good place to put the callbacks.) And then, like Mel said, 2. Implement the callbacks in the subsystems, and register them to the new member in struct page. 3. Call these callbacks before and after migration. I think I'll send a RFC patch next week when I finished the outline. I'm just thinking of finding a common way to solve this problem that all the other subsystems will benefit. Thanks. :) On 05/17/2013 08:23 AM, Benjamin LaHaise wrote: > On Thu, May 16, 2013 at 01:54:18PM +0800, Tang Chen wrote: > ... >> OK, I'll try to figure out a proper place to put the callbacks. >> But I think we need to add something new to struct page. I'm just >> not sure if it is OK. Maybe we can discuss more about it when I send >> a RFC patch. > ... > > I ended up working on this a bit today, and managed to cobble together > something that somewhat works -- please see the patch below. It still is > not completely tested, and it has a rather nasty bug owing to the fact > that the file descriptors returned by anon_inode_getfile() all share the > same inode (read: more than one instance of aio does not work), but it > shows the basic idea. Also, bad things probably happen if someone does > an mremap() on the aio ring buffer. I'll polish this off sometime next > week after the long weekend if noone beats me to it. > > -ben From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756209Ab3EQOhW (ORCPT ); Fri, 17 May 2013 10:37:22 -0400 Received: from kanga.kvack.org ([205.233.56.17]:52859 "EHLO kanga.kvack.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753532Ab3EQOhU (ORCPT ); Fri, 17 May 2013 10:37:20 -0400 Date: Fri, 17 May 2013 10:37:18 -0400 From: Benjamin LaHaise To: Tang Chen Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130517143718.GK1008@kvack.org> References: <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5195A3F4.70803@cn.fujitsu.com> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 17, 2013 at 11:28:52AM +0800, Tang Chen wrote: > Hi Benjamin, > > Thank you very much for your idea. :) > > I have no objection to your idea, but seeing from your patch, this only > works for aio subsystem because you changed the way to allocate the aio > ring pages, with a file mapping. That is correct. There is no way you're going to be able to solve this problem without dealing with the issue on a subsystem by subsystem basis. > So far as I know, not only aio, but also other subsystems, such CMA, will > also have problem like this. The page cannot be migrated because it is > pinned in memory. So I think we should work out a common way to solve how > to migrate pinned pages. A generic approach would require hardware support, but I doubt that is going to happen. > I'm working in the way Mel has said, migrate_unpin() and migrate_pin() > callbacks. But as you saw, I met some problems, like I don't where to put > these two callbacks. And discussed with you guys, I want to try this: > > 1. Add a new member to struct page, used to remember the pin holders of > this page, including the pin and unpin callbacks and the necessary data. > This is more like a callback chain. > (I'm worry about this step, I'm not sure if it is good enough. After > all, > we need a good place to put the callbacks.) Putting function pointers into struct page is not going to happen. You'd be adding a significant amount of memory overhead for something that is never going to be used on the vast majority of systems (2 function pointers would be 16 bytes per page on a 64 bit system). Keep in mind that distro kernels tend to enable almost all config options on their kernels, so the overhead of any approach has to make sense for the users of the kernel that will never make use of this kind of migration. > And then, like Mel said, > > 2. Implement the callbacks in the subsystems, and register them to the > new member in struct page. No, the hook should be in the address_space_operations. We already have a pointer to an address space in struct page. This avoids adding more overhead to struct page. > 3. Call these callbacks before and after migration. How is that better than using the existing hook in address_space_operations? > I think I'll send a RFC patch next week when I finished the outline. I'm > just thinking of finding a common way to solve this problem that all the > other subsystems will benefit. Before pursuing this approach, make sure you've got buy-in for all of the overhead you're adding to the system. I don't think that growing struct page is going to be an acceptable design choice given the amount of overhead it will incur. > Thanks. :) Cheers, -ben -- "Thought is the essence of where you are now." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755300Ab3EQSRe (ORCPT ); Fri, 17 May 2013 14:17:34 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54582 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754708Ab3EQSRc (ORCPT ); Fri, 17 May 2013 14:17:32 -0400 Date: Fri, 17 May 2013 11:17:08 -0700 From: Zach Brown To: Benjamin LaHaise Cc: Tang Chen , Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130517181708.GG318@lenny.home.zabbo.net> References: <1360056113-14294-2-git-send-email-linfeng@cn.fujitsu.com> <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130517002349.GI1008@kvack.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > I ended up working on this a bit today, and managed to cobble together > something that somewhat works -- please see the patch below. Just some quick observations: > + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); > + if (IS_ERR(ctx->ctx_file)) { > + ctx->ctx_file = NULL; > + return -EAGAIN; > + } It's too bad that aio contexts will now be accounted against the filp limits (get_empty_filp -> files_stat.max_files, etc). > + for (i=0; i + struct page *page; > + void *ptr; > + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping, > + i, GFP_KERNEL); > + if (!page) { > + break; > + } > + ptr = kmap(page); > + clear_page(ptr); > + kunmap(page); > + SetPageUptodate(page); > + SetPageDirty(page); > + unlock_page(page); > + } If they're GFP_KERNEL then you don't need to kmap them. But we probably want to allocate with GFP_HIGHUSER and then use clear_user_highpage() to zero them? - z From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756031Ab3EQSaG (ORCPT ); Fri, 17 May 2013 14:30:06 -0400 Received: from kanga.kvack.org ([205.233.56.17]:57907 "EHLO kanga.kvack.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754050Ab3EQSaE (ORCPT ); Fri, 17 May 2013 14:30:04 -0400 Date: Fri, 17 May 2013 14:30:03 -0400 From: Benjamin LaHaise To: Zach Brown Cc: Tang Chen , Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130517183003.GL1008@kvack.org> References: <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <20130517181708.GG318@lenny.home.zabbo.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130517181708.GG318@lenny.home.zabbo.net> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 17, 2013 at 11:17:08AM -0700, Zach Brown wrote: > > I ended up working on this a bit today, and managed to cobble together > > something that somewhat works -- please see the patch below. > > Just some quick observations: > > > + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); > > + if (IS_ERR(ctx->ctx_file)) { > > + ctx->ctx_file = NULL; > > + return -EAGAIN; > > + } > > It's too bad that aio contexts will now be accounted against the filp > limits (get_empty_filp -> files_stat.max_files, etc). Yeah, that is a downside of this approach. It would be possible to to do it with only an inode/address_space, but that would mean bypassing do_mmap(), which is not worth considering. If it is really an issue, we could add a flag to bypass that limit since aio has its own. anon_inode_getfile() as it stands is a major problem. > > + for (i=0; i > + struct page *page; > > + void *ptr; > > + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping, > > + i, GFP_KERNEL); > > + if (!page) { > > + break; > > + } > > + ptr = kmap(page); > > + clear_page(ptr); > > + kunmap(page); > > + SetPageUptodate(page); > > + SetPageDirty(page); > > + unlock_page(page); > > + } > > If they're GFP_KERNEL then you don't need to kmap them. But we probably > want to allocate with GFP_HIGHUSER and then use clear_user_highpage() to > zero them? Adding __GFP_ZERO would fix that too. The next respin will include that change. I also have to properly handle the mremap() case as well. -ben -- "Thought is the essence of where you are now." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756263Ab3EUCMj (ORCPT ); Mon, 20 May 2013 22:12:39 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:21334 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1750732Ab3EUCMh (ORCPT ); Mon, 20 May 2013 22:12:37 -0400 X-IronPort-AV: E=Sophos;i="4.87,711,1363104000"; d="scan'208";a="7318598" Message-ID: <519AD6F8.2070504@cn.fujitsu.com> Date: Tue, 21 May 2013 10:07:52 +0800 From: Tang Chen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Benjamin LaHaise CC: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <20130205120137.GG21389@suse.de> <20130206004234.GD11197@blaptop> <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> In-Reply-To: <20130517143718.GK1008@kvack.org> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/21 10:03:40, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/05/21 10:04:09, Serialize complete at 2013/05/21 10:04:09 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Benjamin, Sorry for the late. Please see below. On 05/17/2013 10:37 PM, Benjamin LaHaise wrote: > On Fri, May 17, 2013 at 11:28:52AM +0800, Tang Chen wrote: >> Hi Benjamin, >> >> Thank you very much for your idea. :) >> >> I have no objection to your idea, but seeing from your patch, this only >> works for aio subsystem because you changed the way to allocate the aio >> ring pages, with a file mapping. > > That is correct. There is no way you're going to be able to solve this > problem without dealing with the issue on a subsystem by subsystem basis. > Yes, I understand that. We need subsystem work anyway. >> I'm working in the way Mel has said, migrate_unpin() and migrate_pin() >> callbacks. But as you saw, I met some problems, like I don't where to put >> these two callbacks. And discussed with you guys, I want to try this: >> >> 1. Add a new member to struct page, used to remember the pin holders of >> this page, including the pin and unpin callbacks and the necessary data. >> This is more like a callback chain. >> (I'm worry about this step, I'm not sure if it is good enough. After >> all, >> we need a good place to put the callbacks.) > > Putting function pointers into struct page is not going to happen. You'd > be adding a significant amount of memory overhead for something that is > never going to be used on the vast majority of systems (2 function pointers > would be 16 bytes per page on a 64 bit system). Keep in mind that distro > kernels tend to enable almost all config options on their kernels, so the > overhead of any approach has to make sense for the users of the kernel that > will never make use of this kind of migration. True. But I just cannot find a place to hold the callbacks. > >> 3. Call these callbacks before and after migration. > > How is that better than using the existing hook in address_space_operations? I'm not saying using two callbacks before and after migration is better. I don't want to use address_space_operations is because there is no such member for anonymous pages. In your idea, using a file mapping will create a address_space_operations. But I really don't think we can modify the way of memory allocation for all the subsystems who has this problem. Maybe not just aio and cma. That means if you want to pin pages in memory, you have to use a file mapping. This makes the memory allocation more complicated. And the idea should be known by all the subsystem developers. Is that going to happen ? I also thought about reuse one field of struct page. But as you said, there may not be many users of this functionality. Reusing a field of struct page will make things more complicated and lead to high coupling. So, how about the other idea that Mel mentioned ? We create a 1-1 mapping of pinned page ranges and the pinner (subsystem callbacks and data), maybe a global list or a hash table. And then, we can find the callbacks. Thanks. :) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757495Ab3EUC1g (ORCPT ); Mon, 20 May 2013 22:27:36 -0400 Received: from kanga.kvack.org ([205.233.56.17]:46920 "EHLO kanga.kvack.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750804Ab3EUC1e (ORCPT ); Mon, 20 May 2013 22:27:34 -0400 Date: Mon, 20 May 2013 22:27:33 -0400 From: Benjamin LaHaise To: Tang Chen Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130521022733.GT1008@kvack.org> References: <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <519AD6F8.2070504@cn.fujitsu.com> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 21, 2013 at 10:07:52AM +0800, Tang Chen wrote: .... > I'm not saying using two callbacks before and after migration is better. > I don't want to use address_space_operations is because there is no such > member > for anonymous pages. That depends on the nature of the pinning. For the general case of get_user_pages(), you're correct that it won't work for anonymous memory. > In your idea, using a file mapping will create a > address_space_operations. But > I really don't think we can modify the way of memory allocation for all the > subsystems who has this problem. Maybe not just aio and cma. That means if > you want to pin pages in memory, you have to use a file mapping. This makes > the memory allocation more complicated. And the idea should be known by all > the subsystem developers. Is that going to happen ? Different subsystems will need to use different approaches to fixing the issue. I doubt any single approach will work for everything. > I also thought about reuse one field of struct page. But as you said, there > may not be many users of this functionality. Reusing a field of struct page > will make things more complicated and lead to high coupling. What happens when more than one subsystem tries to pin a particular page? What if it's a shared page rather than an anonymous page? > So, how about the other idea that Mel mentioned ? > > We create a 1-1 mapping of pinned page ranges and the pinner (subsystem > callbacks and data), maybe a global list or a hash table. And then, we can > find the callbacks. Maybe that is the simplest approach, but it's going to make get_user_pages() slower and more complicated (as if it wasn't already). Maybe with all the bells and whistles of per-cpu data structures and such you can make it work, but I'm pretty sure someone running the large unmentionable benchmark will complain about the performance regressions you're going to introduce. At least in the case of the AIO ring buffer, using the address_space approach doesn't introduce any new performance issues. There's also the bigger question of if you can or cannot exclude get_user_pages_fast() from this. In short: you've got a lot more work on your hands to do. > Thanks. :) Cheers, -ben -- "Thought is the essence of where you are now." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754083Ab3FKJjm (ORCPT ); Tue, 11 Jun 2013 05:39:42 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:53091 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1752404Ab3FKJjk (ORCPT ); Tue, 11 Jun 2013 05:39:40 -0400 X-IronPort-AV: E=Sophos;i="4.87,844,1363104000"; d="scan'208";a="7519758" Message-ID: <51B6F107.80501@cn.fujitsu.com> Date: Tue, 11 Jun 2013 17:42:31 +0800 From: Tang Chen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Benjamin LaHaise CC: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <20130206095617.GN21389@suse.de> <5190AE4F.4000103@cn.fujitsu.com> <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> In-Reply-To: <20130521022733.GT1008@kvack.org> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/06/11 17:37:42, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/06/11 17:37:47, Serialize complete at 2013/06/11 17:37:47 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Benjamin, Are you still working on this problem ? Thanks. :) On 05/21/2013 10:27 AM, Benjamin LaHaise wrote: > On Tue, May 21, 2013 at 10:07:52AM +0800, Tang Chen wrote: > .... >> I'm not saying using two callbacks before and after migration is better. >> I don't want to use address_space_operations is because there is no such >> member >> for anonymous pages. > > That depends on the nature of the pinning. For the general case of > get_user_pages(), you're correct that it won't work for anonymous memory. > >> In your idea, using a file mapping will create a >> address_space_operations. But >> I really don't think we can modify the way of memory allocation for all the >> subsystems who has this problem. Maybe not just aio and cma. That means if >> you want to pin pages in memory, you have to use a file mapping. This makes >> the memory allocation more complicated. And the idea should be known by all >> the subsystem developers. Is that going to happen ? > > Different subsystems will need to use different approaches to fixing the > issue. I doubt any single approach will work for everything. > >> I also thought about reuse one field of struct page. But as you said, there >> may not be many users of this functionality. Reusing a field of struct page >> will make things more complicated and lead to high coupling. > > What happens when more than one subsystem tries to pin a particular page? > What if it's a shared page rather than an anonymous page? > >> So, how about the other idea that Mel mentioned ? >> >> We create a 1-1 mapping of pinned page ranges and the pinner (subsystem >> callbacks and data), maybe a global list or a hash table. And then, we can >> find the callbacks. > > Maybe that is the simplest approach, but it's going to make get_user_pages() > slower and more complicated (as if it wasn't already). Maybe with all the > bells and whistles of per-cpu data structures and such you can make it work, > but I'm pretty sure someone running the large unmentionable benchmark will > complain about the performance regressions you're going to introduce. At > least in the case of the AIO ring buffer, using the address_space approach > doesn't introduce any new performance issues. There's also the bigger > question of if you can or cannot exclude get_user_pages_fast() from this. > In short: you've got a lot more work on your hands to do. > >> Thanks. :) > > Cheers, > > -ben From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754198Ab3FKOp2 (ORCPT ); Tue, 11 Jun 2013 10:45:28 -0400 Received: from kanga.kvack.org ([205.233.56.17]:40085 "EHLO kanga.kvack.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752057Ab3FKOp0 (ORCPT ); Tue, 11 Jun 2013 10:45:26 -0400 Date: Tue, 11 Jun 2013 10:45:25 -0400 From: Benjamin LaHaise To: Tang Chen Cc: Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130611144525.GB14404@kvack.org> References: <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51B6F107.80501@cn.fujitsu.com> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Tang, On Tue, Jun 11, 2013 at 05:42:31PM +0800, Tang Chen wrote: > Hi Benjamin, > > Are you still working on this problem ? > > Thanks. :) Below is a copy of the most recent version of this patch I have worked on. This version works and stands up to my testing using move_pages() to force the migration of the aio ring buffer. A test program is available at http://www.kvack.org/~bcrl/aio/aio-numa-test.c . Please note that this version is not suitable for mainline as the modifactions to the anon inode code are undesirable, so that part needs reworking. -ben fs/aio.c | 113 ++++++++++++++++++++++++++++++++++++++++++++---- fs/anon_inodes.c | 14 ++++- include/linux/migrate.h | 3 + mm/migrate.c | 2 mm/swap.c | 1 5 files changed, 121 insertions(+), 12 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index c5b1a8c..a951690 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -35,6 +35,9 @@ #include #include #include +#include +#include +#include #include #include @@ -108,6 +111,7 @@ struct kioctx { } ____cacheline_aligned_in_smp; struct page *internal_pages[AIO_RING_PAGES]; + struct file *ctx_file; }; /*------ sysctl variables----*/ @@ -136,18 +140,80 @@ __initcall(aio_setup); static void aio_free_ring(struct kioctx *ctx) { - long i; - - for (i = 0; i < ctx->nr_pages; i++) - put_page(ctx->ring_pages[i]); + int i; if (ctx->mmap_size) vm_munmap(ctx->mmap_base, ctx->mmap_size); + if (ctx->ctx_file) + truncate_setsize(ctx->ctx_file->f_inode, 0); + + for (i = 0; i < ctx->nr_pages; i++) { + pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, + page_count(ctx->ring_pages[i])); + put_page(ctx->ring_pages[i]); + } + if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) kfree(ctx->ring_pages); + + if (ctx->ctx_file) { + truncate_setsize(ctx->ctx_file->f_inode, 0); + pr_debug("pid(%d) i_nlink=%u d_count=%d, d_unhashed=%d i_count=%d\n", + current->pid, ctx->ctx_file->f_inode->i_nlink, + ctx->ctx_file->f_path.dentry->d_count, + d_unhashed(ctx->ctx_file->f_path.dentry), + atomic_read(&ctx->ctx_file->f_path.dentry->d_inode->i_count)); + fput(ctx->ctx_file); + ctx->ctx_file = NULL; + } +} + +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma->vm_ops = &generic_file_vm_ops; + return 0; +} + +static const struct file_operations aio_ctx_fops = { + .mmap = aio_ctx_mmap, +}; + +static int aio_set_page_dirty(struct page *page) +{ + return 0; +} + +static int aio_migratepage(struct address_space *mapping, struct page *new, + struct page *old, enum migrate_mode mode) +{ + struct kioctx *ctx = mapping->private_data; + unsigned long flags; + unsigned idx = old->index; + int rc; + + BUG_ON(PageWriteback(old)); /* Writeback must be complete */ + put_page(old); + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); + if (rc != MIGRATEPAGE_SUCCESS) { + get_page(old); + return rc; + } + get_page(new); + + spin_lock_irqsave(&ctx->completion_lock, flags); + migrate_page_copy(new, old); + ctx->ring_pages[idx] = new; + spin_unlock_irqrestore(&ctx->completion_lock, flags); + + return MIGRATEPAGE_SUCCESS; } +static const struct address_space_operations aio_ctx_aops = { + .set_page_dirty = aio_set_page_dirty, + .migratepage = aio_migratepage, +}; + static int aio_setup_ring(struct kioctx *ctx) { struct aio_ring *ring; @@ -155,6 +221,7 @@ static int aio_setup_ring(struct kioctx *ctx) struct mm_struct *mm = current->mm; unsigned long size, populate; int nr_pages; + int i; /* Compensate for the ring buffer's head/tail overlap entry */ nr_events += 2; /* 1 is required, 2 for good luck */ @@ -166,6 +233,28 @@ static int aio_setup_ring(struct kioctx *ctx) if (nr_pages < 0) return -EINVAL; + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); + if (IS_ERR(ctx->ctx_file)) { + ctx->ctx_file = NULL; + return -EAGAIN; + } + ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops; + ctx->ctx_file->f_inode->i_mapping->private_data = ctx; + ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages; + + for (i=0; ictx_file->f_inode->i_mapping, + i, GFP_HIGHUSER | __GFP_ZERO); + if (!page) + break; + pr_debug("pid(%d) page[%d]->count=%d\n", + current->pid, i, page_count(page)); + SetPageUptodate(page); + SetPageDirty(page); + unlock_page(page); + } + nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); ctx->nr_events = 0; @@ -180,20 +269,25 @@ static int aio_setup_ring(struct kioctx *ctx) ctx->mmap_size = nr_pages * PAGE_SIZE; pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size); down_write(&mm->mmap_sem); - ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size, - PROT_READ|PROT_WRITE, - MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate); + ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, 0, + &populate); if (IS_ERR((void *)ctx->mmap_base)) { up_write(&mm->mmap_sem); ctx->mmap_size = 0; aio_free_ring(ctx); return -EAGAIN; } + up_write(&mm->mmap_sem); + mm_populate(ctx->mmap_base, populate); pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, 1, 0, ctx->ring_pages, NULL); - up_write(&mm->mmap_sem); + for (i=0; inr_pages; i++) { + put_page(ctx->ring_pages[i]); + } if (unlikely(ctx->nr_pages != nr_pages)) { aio_free_ring(ctx); @@ -403,6 +497,8 @@ out_cleanup: err = -EAGAIN; aio_free_ring(ctx); out_freectx: + if (ctx->ctx_file) + fput(ctx->ctx_file); kmem_cache_free(kioctx_cachep, ctx); pr_debug("error allocating ioctx %d\n", err); return ERR_PTR(err); @@ -852,6 +948,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) ioctx = ioctx_alloc(nr_events); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { + ctx = ioctx->user_id; ret = put_user(ioctx->user_id, ctxp); if (ret) kill_ioctx(ioctx); diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 47a65df..376d289 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -131,6 +131,7 @@ struct file *anon_inode_getfile(const char *name, struct qstr this; struct path path; struct file *file; + struct inode *inode; if (IS_ERR(anon_inode_inode)) return ERR_PTR(-ENODEV); @@ -138,6 +139,12 @@ struct file *anon_inode_getfile(const char *name, if (fops->owner && !try_module_get(fops->owner)) return ERR_PTR(-ENOENT); + inode = anon_inode_mkinode(anon_inode_inode->i_sb); + if (IS_ERR(inode)) { + file = ERR_PTR(-ENOMEM); + goto err_module; + } + /* * Link the inode to a directory entry by creating a unique name * using the inode sequence number. @@ -155,17 +162,18 @@ struct file *anon_inode_getfile(const char *name, * We know the anon_inode inode count is always greater than zero, * so ihold() is safe. */ - ihold(anon_inode_inode); + //ihold(inode); - d_instantiate(path.dentry, anon_inode_inode); + d_instantiate(path.dentry, inode); file = alloc_file(&path, OPEN_FMODE(flags), fops); if (IS_ERR(file)) goto err_dput; - file->f_mapping = anon_inode_inode->i_mapping; + file->f_mapping = inode->i_mapping; file->f_flags = flags & (O_ACCMODE | O_NONBLOCK); file->private_data = priv; + drop_nlink(inode); return file; diff --git a/include/linux/migrate.h b/include/linux/migrate.h index a405d3dc..b6f3289 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm, extern void migrate_page_copy(struct page *newpage, struct page *page); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); +extern int migrate_page_move_mapping(struct address_space *mapping, + struct page *newpage, struct page *page, + struct buffer_head *head, enum migrate_mode mode); #else static inline void putback_lru_pages(struct list_head *l) {} diff --git a/mm/migrate.c b/mm/migrate.c index 27ed225..ac9c3a9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, * 2 for pages with a mapping * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. */ -static int migrate_page_move_mapping(struct address_space *mapping, +int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, struct buffer_head *head, enum migrate_mode mode) { diff --git a/mm/swap.c b/mm/swap.c index dfd7d71..bbfba0a 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -160,6 +160,7 @@ skip_lock_tail: void put_page(struct page *page) { + BUG_ON(page_count(page) <= 0); if (unlikely(PageCompound(page))) put_compound_page(page); else if (put_page_testzero(page)) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755191Ab3F1J1j (ORCPT ); Fri, 28 Jun 2013 05:27:39 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:53086 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1755043Ab3F1J1g (ORCPT ); Fri, 28 Jun 2013 05:27:36 -0400 X-IronPort-AV: E=Sophos;i="4.87,957,1363104000"; d="scan'208";a="7716752" Message-ID: <51CD5649.8040408@cn.fujitsu.com> Date: Fri, 28 Jun 2013 17:24:25 +0800 From: Gu Zheng User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20110930 Thunderbird/7.0.1 MIME-Version: 1.0 To: Benjamin LaHaise CC: Tang Chen , Mel Gorman , Minchan Kim , Lin Feng , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> In-Reply-To: <20130611144525.GB14404@kvack.org> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/06/28 17:26:10, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/06/28 17:26:13, Serialize complete at 2013/06/28 17:26:13 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/11/2013 10:45 PM, Benjamin LaHaise wrote: > Hi Tang, > > On Tue, Jun 11, 2013 at 05:42:31PM +0800, Tang Chen wrote: >> Hi Benjamin, >> >> Are you still working on this problem ? >> >> Thanks. :) > > Below is a copy of the most recent version of this patch I have worked > on. This version works and stands up to my testing using move_pages() to > force the migration of the aio ring buffer. A test program is available > at http://www.kvack.org/~bcrl/aio/aio-numa-test.c . Please note that > this version is not suitable for mainline as the modifactions to the > anon inode code are undesirable, so that part needs reworking. Hi Ben, Are you still working on this patch? As you know, using the current anon inode will lead to more than one instance of aio can not work. Have you found a way to fix this issue? Or can we use some other ones to replace the anon inode? Thanks, Gu > > -ben > > > fs/aio.c | 113 ++++++++++++++++++++++++++++++++++++++++++++---- > fs/anon_inodes.c | 14 ++++- > include/linux/migrate.h | 3 + > mm/migrate.c | 2 > mm/swap.c | 1 > 5 files changed, 121 insertions(+), 12 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index c5b1a8c..a951690 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -35,6 +35,9 @@ > #include > #include > #include > +#include > +#include > +#include > > #include > #include > @@ -108,6 +111,7 @@ struct kioctx { > } ____cacheline_aligned_in_smp; > > struct page *internal_pages[AIO_RING_PAGES]; > + struct file *ctx_file; > }; > > /*------ sysctl variables----*/ > @@ -136,18 +140,80 @@ __initcall(aio_setup); > > static void aio_free_ring(struct kioctx *ctx) > { > - long i; > - > - for (i = 0; i < ctx->nr_pages; i++) > - put_page(ctx->ring_pages[i]); > + int i; > > if (ctx->mmap_size) > vm_munmap(ctx->mmap_base, ctx->mmap_size); > > + if (ctx->ctx_file) > + truncate_setsize(ctx->ctx_file->f_inode, 0); > + > + for (i = 0; i < ctx->nr_pages; i++) { > + pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, > + page_count(ctx->ring_pages[i])); > + put_page(ctx->ring_pages[i]); > + } > + > if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) > kfree(ctx->ring_pages); > + > + if (ctx->ctx_file) { > + truncate_setsize(ctx->ctx_file->f_inode, 0); > + pr_debug("pid(%d) i_nlink=%u d_count=%d, d_unhashed=%d i_count=%d\n", > + current->pid, ctx->ctx_file->f_inode->i_nlink, > + ctx->ctx_file->f_path.dentry->d_count, > + d_unhashed(ctx->ctx_file->f_path.dentry), > + atomic_read(&ctx->ctx_file->f_path.dentry->d_inode->i_count)); > + fput(ctx->ctx_file); > + ctx->ctx_file = NULL; > + } > +} > + > +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + vma->vm_ops = &generic_file_vm_ops; > + return 0; > +} > + > +static const struct file_operations aio_ctx_fops = { > + .mmap = aio_ctx_mmap, > +}; > + > +static int aio_set_page_dirty(struct page *page) > +{ > + return 0; > +} > + > +static int aio_migratepage(struct address_space *mapping, struct page *new, > + struct page *old, enum migrate_mode mode) > +{ > + struct kioctx *ctx = mapping->private_data; > + unsigned long flags; > + unsigned idx = old->index; > + int rc; > + > + BUG_ON(PageWriteback(old)); /* Writeback must be complete */ > + put_page(old); > + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); > + if (rc != MIGRATEPAGE_SUCCESS) { > + get_page(old); > + return rc; > + } > + get_page(new); > + > + spin_lock_irqsave(&ctx->completion_lock, flags); > + migrate_page_copy(new, old); > + ctx->ring_pages[idx] = new; > + spin_unlock_irqrestore(&ctx->completion_lock, flags); > + > + return MIGRATEPAGE_SUCCESS; > } > > +static const struct address_space_operations aio_ctx_aops = { > + .set_page_dirty = aio_set_page_dirty, > + .migratepage = aio_migratepage, > +}; > + > static int aio_setup_ring(struct kioctx *ctx) > { > struct aio_ring *ring; > @@ -155,6 +221,7 @@ static int aio_setup_ring(struct kioctx *ctx) > struct mm_struct *mm = current->mm; > unsigned long size, populate; > int nr_pages; > + int i; > > /* Compensate for the ring buffer's head/tail overlap entry */ > nr_events += 2; /* 1 is required, 2 for good luck */ > @@ -166,6 +233,28 @@ static int aio_setup_ring(struct kioctx *ctx) > if (nr_pages < 0) > return -EINVAL; > > + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); > + if (IS_ERR(ctx->ctx_file)) { > + ctx->ctx_file = NULL; > + return -EAGAIN; > + } > + ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops; > + ctx->ctx_file->f_inode->i_mapping->private_data = ctx; > + ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages; > + > + for (i=0; i + struct page *page; > + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping, > + i, GFP_HIGHUSER | __GFP_ZERO); > + if (!page) > + break; > + pr_debug("pid(%d) page[%d]->count=%d\n", > + current->pid, i, page_count(page)); > + SetPageUptodate(page); > + SetPageDirty(page); > + unlock_page(page); > + } > + > nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); > > ctx->nr_events = 0; > @@ -180,20 +269,25 @@ static int aio_setup_ring(struct kioctx *ctx) > ctx->mmap_size = nr_pages * PAGE_SIZE; > pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size); > down_write(&mm->mmap_sem); > - ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size, > - PROT_READ|PROT_WRITE, > - MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate); > + ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size, > + PROT_READ | PROT_WRITE, > + MAP_SHARED | MAP_POPULATE, 0, > + &populate); > if (IS_ERR((void *)ctx->mmap_base)) { > up_write(&mm->mmap_sem); > ctx->mmap_size = 0; > aio_free_ring(ctx); > return -EAGAIN; > } > + up_write(&mm->mmap_sem); > + mm_populate(ctx->mmap_base, populate); > > pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); > ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, > 1, 0, ctx->ring_pages, NULL); > - up_write(&mm->mmap_sem); > + for (i=0; inr_pages; i++) { > + put_page(ctx->ring_pages[i]); > + } > > if (unlikely(ctx->nr_pages != nr_pages)) { > aio_free_ring(ctx); > @@ -403,6 +497,8 @@ out_cleanup: > err = -EAGAIN; > aio_free_ring(ctx); > out_freectx: > + if (ctx->ctx_file) > + fput(ctx->ctx_file); > kmem_cache_free(kioctx_cachep, ctx); > pr_debug("error allocating ioctx %d\n", err); > return ERR_PTR(err); > @@ -852,6 +948,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) > ioctx = ioctx_alloc(nr_events); > ret = PTR_ERR(ioctx); > if (!IS_ERR(ioctx)) { > + ctx = ioctx->user_id; > ret = put_user(ioctx->user_id, ctxp); > if (ret) > kill_ioctx(ioctx); > diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c > index 47a65df..376d289 100644 > --- a/fs/anon_inodes.c > +++ b/fs/anon_inodes.c > @@ -131,6 +131,7 @@ struct file *anon_inode_getfile(const char *name, > struct qstr this; > struct path path; > struct file *file; > + struct inode *inode; > > if (IS_ERR(anon_inode_inode)) > return ERR_PTR(-ENODEV); > @@ -138,6 +139,12 @@ struct file *anon_inode_getfile(const char *name, > if (fops->owner && !try_module_get(fops->owner)) > return ERR_PTR(-ENOENT); > > + inode = anon_inode_mkinode(anon_inode_inode->i_sb); > + if (IS_ERR(inode)) { > + file = ERR_PTR(-ENOMEM); > + goto err_module; > + } > + > /* > * Link the inode to a directory entry by creating a unique name > * using the inode sequence number. > @@ -155,17 +162,18 @@ struct file *anon_inode_getfile(const char *name, > * We know the anon_inode inode count is always greater than zero, > * so ihold() is safe. > */ > - ihold(anon_inode_inode); > + //ihold(inode); > > - d_instantiate(path.dentry, anon_inode_inode); > + d_instantiate(path.dentry, inode); > > file = alloc_file(&path, OPEN_FMODE(flags), fops); > if (IS_ERR(file)) > goto err_dput; > - file->f_mapping = anon_inode_inode->i_mapping; > + file->f_mapping = inode->i_mapping; > > file->f_flags = flags & (O_ACCMODE | O_NONBLOCK); > file->private_data = priv; > + drop_nlink(inode); > > return file; > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index a405d3dc..b6f3289 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm, > extern void migrate_page_copy(struct page *newpage, struct page *page); > extern int migrate_huge_page_move_mapping(struct address_space *mapping, > struct page *newpage, struct page *page); > +extern int migrate_page_move_mapping(struct address_space *mapping, > + struct page *newpage, struct page *page, > + struct buffer_head *head, enum migrate_mode mode); > #else > > static inline void putback_lru_pages(struct list_head *l) {} > diff --git a/mm/migrate.c b/mm/migrate.c > index 27ed225..ac9c3a9 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, > * 2 for pages with a mapping > * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. > */ > -static int migrate_page_move_mapping(struct address_space *mapping, > +int migrate_page_move_mapping(struct address_space *mapping, > struct page *newpage, struct page *page, > struct buffer_head *head, enum migrate_mode mode) > { > diff --git a/mm/swap.c b/mm/swap.c > index dfd7d71..bbfba0a 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -160,6 +160,7 @@ skip_lock_tail: > > void put_page(struct page *page) > { > + BUG_ON(page_count(page) <= 0); > if (unlikely(PageCompound(page))) > put_compound_page(page); > else if (put_page_testzero(page)) > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752841Ab3GAH06 (ORCPT ); Mon, 1 Jul 2013 03:26:58 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:54652 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1752667Ab3GAH04 (ORCPT ); Mon, 1 Jul 2013 03:26:56 -0400 X-IronPort-AV: E=Sophos;i="4.87,972,1363104000"; d="scan'208";a="7745757" Message-ID: <51D12E7B.6080301@cn.fujitsu.com> Date: Mon, 01 Jul 2013 15:23:39 +0800 From: Gu Zheng User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20110930 Thunderbird/7.0.1 MIME-Version: 1.0 To: Benjamin LaHaise CC: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <20130513091902.GP11497@suse.de> <5191B5B3.7080406@cn.fujitsu.com> <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> In-Reply-To: <20130611144525.GB14404@kvack.org> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/07/01 15:25:24, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/07/01 15:25:28, Serialize complete at 2013/07/01 15:25:28 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/11/2013 10:45 PM, Benjamin LaHaise wrote: > Hi Tang, > > On Tue, Jun 11, 2013 at 05:42:31PM +0800, Tang Chen wrote: >> Hi Benjamin, >> >> Are you still working on this problem ? >> >> Thanks. :) > > Below is a copy of the most recent version of this patch I have worked > on. This version works and stands up to my testing using move_pages() to > force the migration of the aio ring buffer. A test program is available > at http://www.kvack.org/~bcrl/aio/aio-numa-test.c . Please note that > this version is not suitable for mainline as the modifactions to the > anon inode code are undesirable, so that part needs reworking. Hi Ben, Are you still working on this patch? As you know, using the current anon inode will lead to more than one instance of aio can not work. Have you found a way to fix this issue? Or can we use some other ones to replace the anon inode? Thanks, Gu > > -ben > > > fs/aio.c | 113 ++++++++++++++++++++++++++++++++++++++++++++---- > fs/anon_inodes.c | 14 ++++- > include/linux/migrate.h | 3 + > mm/migrate.c | 2 > mm/swap.c | 1 > 5 files changed, 121 insertions(+), 12 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index c5b1a8c..a951690 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -35,6 +35,9 @@ > #include > #include > #include > +#include > +#include > +#include > > #include > #include > @@ -108,6 +111,7 @@ struct kioctx { > } ____cacheline_aligned_in_smp; > > struct page *internal_pages[AIO_RING_PAGES]; > + struct file *ctx_file; > }; > > /*------ sysctl variables----*/ > @@ -136,18 +140,80 @@ __initcall(aio_setup); > > static void aio_free_ring(struct kioctx *ctx) > { > - long i; > - > - for (i = 0; i < ctx->nr_pages; i++) > - put_page(ctx->ring_pages[i]); > + int i; > > if (ctx->mmap_size) > vm_munmap(ctx->mmap_base, ctx->mmap_size); > > + if (ctx->ctx_file) > + truncate_setsize(ctx->ctx_file->f_inode, 0); > + > + for (i = 0; i < ctx->nr_pages; i++) { > + pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, > + page_count(ctx->ring_pages[i])); > + put_page(ctx->ring_pages[i]); > + } > + > if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) > kfree(ctx->ring_pages); > + > + if (ctx->ctx_file) { > + truncate_setsize(ctx->ctx_file->f_inode, 0); > + pr_debug("pid(%d) i_nlink=%u d_count=%d, d_unhashed=%d i_count=%d\n", > + current->pid, ctx->ctx_file->f_inode->i_nlink, > + ctx->ctx_file->f_path.dentry->d_count, > + d_unhashed(ctx->ctx_file->f_path.dentry), > + atomic_read(&ctx->ctx_file->f_path.dentry->d_inode->i_count)); > + fput(ctx->ctx_file); > + ctx->ctx_file = NULL; > + } > +} > + > +static int aio_ctx_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + vma->vm_ops = &generic_file_vm_ops; > + return 0; > +} > + > +static const struct file_operations aio_ctx_fops = { > + .mmap = aio_ctx_mmap, > +}; > + > +static int aio_set_page_dirty(struct page *page) > +{ > + return 0; > +} > + > +static int aio_migratepage(struct address_space *mapping, struct page *new, > + struct page *old, enum migrate_mode mode) > +{ > + struct kioctx *ctx = mapping->private_data; > + unsigned long flags; > + unsigned idx = old->index; > + int rc; > + > + BUG_ON(PageWriteback(old)); /* Writeback must be complete */ > + put_page(old); > + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); > + if (rc != MIGRATEPAGE_SUCCESS) { > + get_page(old); > + return rc; > + } > + get_page(new); > + > + spin_lock_irqsave(&ctx->completion_lock, flags); > + migrate_page_copy(new, old); > + ctx->ring_pages[idx] = new; > + spin_unlock_irqrestore(&ctx->completion_lock, flags); > + > + return MIGRATEPAGE_SUCCESS; > } > > +static const struct address_space_operations aio_ctx_aops = { > + .set_page_dirty = aio_set_page_dirty, > + .migratepage = aio_migratepage, > +}; > + > static int aio_setup_ring(struct kioctx *ctx) > { > struct aio_ring *ring; > @@ -155,6 +221,7 @@ static int aio_setup_ring(struct kioctx *ctx) > struct mm_struct *mm = current->mm; > unsigned long size, populate; > int nr_pages; > + int i; > > /* Compensate for the ring buffer's head/tail overlap entry */ > nr_events += 2; /* 1 is required, 2 for good luck */ > @@ -166,6 +233,28 @@ static int aio_setup_ring(struct kioctx *ctx) > if (nr_pages < 0) > return -EINVAL; > > + ctx->ctx_file = anon_inode_getfile("[aio]", &aio_ctx_fops, ctx, O_RDWR); > + if (IS_ERR(ctx->ctx_file)) { > + ctx->ctx_file = NULL; > + return -EAGAIN; > + } > + ctx->ctx_file->f_inode->i_mapping->a_ops = &aio_ctx_aops; > + ctx->ctx_file->f_inode->i_mapping->private_data = ctx; > + ctx->ctx_file->f_inode->i_size = PAGE_SIZE * (loff_t)nr_pages; > + > + for (i=0; i + struct page *page; > + page = find_or_create_page(ctx->ctx_file->f_inode->i_mapping, > + i, GFP_HIGHUSER | __GFP_ZERO); > + if (!page) > + break; > + pr_debug("pid(%d) page[%d]->count=%d\n", > + current->pid, i, page_count(page)); > + SetPageUptodate(page); > + SetPageDirty(page); > + unlock_page(page); > + } > + > nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) / sizeof(struct io_event); > > ctx->nr_events = 0; > @@ -180,20 +269,25 @@ static int aio_setup_ring(struct kioctx *ctx) > ctx->mmap_size = nr_pages * PAGE_SIZE; > pr_debug("attempting mmap of %lu bytes\n", ctx->mmap_size); > down_write(&mm->mmap_sem); > - ctx->mmap_base = do_mmap_pgoff(NULL, 0, ctx->mmap_size, > - PROT_READ|PROT_WRITE, > - MAP_ANONYMOUS|MAP_PRIVATE, 0, &populate); > + ctx->mmap_base = do_mmap_pgoff(ctx->ctx_file, 0, ctx->mmap_size, > + PROT_READ | PROT_WRITE, > + MAP_SHARED | MAP_POPULATE, 0, > + &populate); > if (IS_ERR((void *)ctx->mmap_base)) { > up_write(&mm->mmap_sem); > ctx->mmap_size = 0; > aio_free_ring(ctx); > return -EAGAIN; > } > + up_write(&mm->mmap_sem); > + mm_populate(ctx->mmap_base, populate); > > pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); > ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, > 1, 0, ctx->ring_pages, NULL); > - up_write(&mm->mmap_sem); > + for (i=0; inr_pages; i++) { > + put_page(ctx->ring_pages[i]); > + } > > if (unlikely(ctx->nr_pages != nr_pages)) { > aio_free_ring(ctx); > @@ -403,6 +497,8 @@ out_cleanup: > err = -EAGAIN; > aio_free_ring(ctx); > out_freectx: > + if (ctx->ctx_file) > + fput(ctx->ctx_file); > kmem_cache_free(kioctx_cachep, ctx); > pr_debug("error allocating ioctx %d\n", err); > return ERR_PTR(err); > @@ -852,6 +948,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) > ioctx = ioctx_alloc(nr_events); > ret = PTR_ERR(ioctx); > if (!IS_ERR(ioctx)) { > + ctx = ioctx->user_id; > ret = put_user(ioctx->user_id, ctxp); > if (ret) > kill_ioctx(ioctx); > diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c > index 47a65df..376d289 100644 > --- a/fs/anon_inodes.c > +++ b/fs/anon_inodes.c > @@ -131,6 +131,7 @@ struct file *anon_inode_getfile(const char *name, > struct qstr this; > struct path path; > struct file *file; > + struct inode *inode; > > if (IS_ERR(anon_inode_inode)) > return ERR_PTR(-ENODEV); > @@ -138,6 +139,12 @@ struct file *anon_inode_getfile(const char *name, > if (fops->owner && !try_module_get(fops->owner)) > return ERR_PTR(-ENOENT); > > + inode = anon_inode_mkinode(anon_inode_inode->i_sb); > + if (IS_ERR(inode)) { > + file = ERR_PTR(-ENOMEM); > + goto err_module; > + } > + > /* > * Link the inode to a directory entry by creating a unique name > * using the inode sequence number. > @@ -155,17 +162,18 @@ struct file *anon_inode_getfile(const char *name, > * We know the anon_inode inode count is always greater than zero, > * so ihold() is safe. > */ > - ihold(anon_inode_inode); > + //ihold(inode); > > - d_instantiate(path.dentry, anon_inode_inode); > + d_instantiate(path.dentry, inode); > > file = alloc_file(&path, OPEN_FMODE(flags), fops); > if (IS_ERR(file)) > goto err_dput; > - file->f_mapping = anon_inode_inode->i_mapping; > + file->f_mapping = inode->i_mapping; > > file->f_flags = flags & (O_ACCMODE | O_NONBLOCK); > file->private_data = priv; > + drop_nlink(inode); > > return file; > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index a405d3dc..b6f3289 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -55,6 +55,9 @@ extern int migrate_vmas(struct mm_struct *mm, > extern void migrate_page_copy(struct page *newpage, struct page *page); > extern int migrate_huge_page_move_mapping(struct address_space *mapping, > struct page *newpage, struct page *page); > +extern int migrate_page_move_mapping(struct address_space *mapping, > + struct page *newpage, struct page *page, > + struct buffer_head *head, enum migrate_mode mode); > #else > > static inline void putback_lru_pages(struct list_head *l) {} > diff --git a/mm/migrate.c b/mm/migrate.c > index 27ed225..ac9c3a9 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -294,7 +294,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, > * 2 for pages with a mapping > * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. > */ > -static int migrate_page_move_mapping(struct address_space *mapping, > +int migrate_page_move_mapping(struct address_space *mapping, > struct page *newpage, struct page *page, > struct buffer_head *head, enum migrate_mode mode) > { > diff --git a/mm/swap.c b/mm/swap.c > index dfd7d71..bbfba0a 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -160,6 +160,7 @@ skip_lock_tail: > > void put_page(struct page *page) > { > + BUG_ON(page_count(page) <= 0); > if (unlikely(PageCompound(page))) > put_compound_page(page); > else if (put_page_testzero(page)) > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753497Ab3GBSAN (ORCPT ); Tue, 2 Jul 2013 14:00:13 -0400 Received: from kanga.kvack.org ([205.233.56.17]:33462 "EHLO kanga.kvack.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752060Ab3GBSAL (ORCPT ); Tue, 2 Jul 2013 14:00:11 -0400 Date: Tue, 2 Jul 2013 14:00:08 -0400 From: Benjamin LaHaise To: Gu Zheng Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130702180008.GQ16399@kvack.org> References: <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51D12E7B.6080301@cn.fujitsu.com> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote: > Hi Ben, > Are you still working on this patch? > As you know, using the current anon inode will lead to more than one instance of > aio can not work. Have you found a way to fix this issue? Or can we use some > other ones to replace the anon inode? This patch hasn't been a high priority for me. I would really appreciate it if someone could confirm that this patch does indeed fix the hotplug page migration issue by testing it in a system that hits the bug. Removing the anon_inode bits isn't too much work, but I'd just like to have some confirmation that this fix is considered to be "good enough" for the problem at hand before spending any further time on it. There was talk of using another approach, but it's not clear if there was any progress. -ben -- "Thought is the essence of where you are now." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754738Ab3GCB4z (ORCPT ); Tue, 2 Jul 2013 21:56:55 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:63424 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1752552Ab3GCB4x (ORCPT ); Tue, 2 Jul 2013 21:56:53 -0400 X-IronPort-AV: E=Sophos;i="4.87,984,1363104000"; d="scan'208";a="7766396" Message-ID: <51D3841D.9040906@cn.fujitsu.com> Date: Wed, 03 Jul 2013 09:53:33 +0800 From: Gu Zheng User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20110930 Thunderbird/7.0.1 MIME-Version: 1.0 To: Benjamin LaHaise CC: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> In-Reply-To: <20130702180008.GQ16399@kvack.org> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/07/03 09:55:18, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/07/03 09:55:24, Serialize complete at 2013/07/03 09:55:24 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/03/2013 02:00 AM, Benjamin LaHaise wrote: > On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote: >> Hi Ben, >> Are you still working on this patch? >> As you know, using the current anon inode will lead to more than one instance of >> aio can not work. Have you found a way to fix this issue? Or can we use some >> other ones to replace the anon inode? > > This patch hasn't been a high priority for me. I would really appreciate > it if someone could confirm that this patch does indeed fix the hotplug > page migration issue by testing it in a system that hits the bug. Removing > the anon_inode bits isn't too much work, but I'd just like to have some > confirmation that this fix is considered to be "good enough" for the > problem at hand before spending any further time on it. There was talk of > using another approach, but it's not clear if there was any progress. Yeah, we have not seen anyone try to fix this issue using the other approach we talked. I'm not sure whether your patch can indeed fix the problem, but I'll carry out a complete test to confirm it, and I'll be very glad to continue this job based on your patch if you do not have enough time working on it.:) Thanks, Gu > > -ben From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964818Ab3GDGyp (ORCPT ); Thu, 4 Jul 2013 02:54:45 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:3575 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S932805Ab3GDGyl (ORCPT ); Thu, 4 Jul 2013 02:54:41 -0400 X-IronPort-AV: E=Sophos;i="4.87,992,1363104000"; d="scan'208";a="7780569" Message-ID: <51D51B66.3000301@cn.fujitsu.com> Date: Thu, 04 Jul 2013 14:51:18 +0800 From: Gu Zheng User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20110930 Thunderbird/7.0.1 MIME-Version: 1.0 To: Benjamin LaHaise CC: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <20130515132453.GB11497@suse.de> <5194748A.5070700@cn.fujitsu.com> <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> In-Reply-To: <20130702180008.GQ16399@kvack.org> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/07/04 14:53:02, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/07/04 14:53:08, Serialize complete at 2013/07/04 14:53:08 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/03/2013 02:00 AM, Benjamin LaHaise wrote: > On Mon, Jul 01, 2013 at 03:23:39PM +0800, Gu Zheng wrote: >> Hi Ben, >> Are you still working on this patch? >> As you know, using the current anon inode will lead to more than one instance of >> aio can not work. Have you found a way to fix this issue? Or can we use some >> other ones to replace the anon inode? > > This patch hasn't been a high priority for me. I would really appreciate > it if someone could confirm that this patch does indeed fix the hotplug > page migration issue by testing it in a system that hits the bug. Removing > the anon_inode bits isn't too much work, but I'd just like to have some > confirmation that this fix is considered to be "good enough" for the > problem at hand before spending any further time on it. There was talk of > using another approach, but it's not clear if there was any progress. Hi Ben, When I test your patch on kernel 3.10, the kernel panic when aio job complete or exit, exactly in aio_free_ring(), the following is a part of dmesg. Thanks, Gu kernel BUG at mm/swap.c:163! invalid opcode: 0000 [#1] SMP Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF 3.10.0-aio-migrate+ #107 Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 Workqueue: events kill_ioctx_work task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 RIP: 0010:[] [] put_page+0x48/0x60 RSP: 0018:ffff8807dda99cd8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8807be1f1e00 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea001b196c80 RBP: ffff8807dda99cd8 R08: 0000000000000000 R09: 0000000000000000 R10: ffff8807ffbb5f00 R11: 000000000000005a R12: 0000000000000001 R13: 0000000000000000 R14: ffff8807dda974e0 R15: ffff8807be1f1ec8 FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003b826dc7d0 CR3: 0000000001a0b000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8807dda99d18 ffffffff811b11f6 0000000000000000 0000000200000000 ffff8807be1f1e00 ffff8807be1f1e80 000000000000000c 0000000000000000 ffff8807dda99dc8 ffffffff811b21a2 00000001000438ec ffff8807fd692d00 Call Trace: [] aio_free_ring+0x96/0x1c0 [] free_ioctx+0x1f2/0x250 [] ? idle_balance+0xed/0x140 [] put_ioctx+0x1a/0x30 [] kill_ioctx_work+0x2f/0x40 [] process_one_work+0x183/0x490 [] worker_thread+0x120/0x3a0 [] ? manage_workers+0x160/0x160 [] kthread+0xce/0xe0 [] ? kthread_freezable_should_stop+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? kthread_freezable_should_stop+0x70/0x70 Code: 07 00 c0 75 1f f0 ff 4f 1c 0f 94 c0 84 c0 75 0b c9 66 90 c3 0f 1f 80 00 00 00 00 e8 53 fe ff ff c9 66 90 c3 e8 7a fe ff ff c9 c3 <0f> 0b 66 0f 1f 44 00 00 eb f8 48 8b 47 30 eb bc 0f 1f 84 00 00 RIP [] put_page+0x48/0x60 RSP ---[ end trace b5e2c17407c840d8 ]--- Jul 4 15:49:50 BUG: unable to handle kernel paging request at ffffffffffffffd8 IP: [] kthread_data+0x10/0x20 PGD 1a0c067 PUD 1a0e067 PMD 0 Oops: 0000 [#2] SMP Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF D 3.10.0-aio-migrate+ #107 Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 RIP: 0010:[] [] kthread_data+0x10/0x20 RSP: 0018:ffff8807dda999b8 EFLAGS: 00010092 RAX: 0000000000000000 RBX: 0000000000000004 RCX: ffffffff81da3ea0 RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8807dda974e0 RBP: ffff8807dda999b8 R08: ffff8807dda97550 R09: 0000000000000006 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 R13: ffff8807dda97ab8 R14: 0000000000000001 R15: 0000000000000006 FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000001a0b000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8807dda999d8 ffffffff8105e155 ffff8807dda999d8 ffff8807fd692d00 ffff8807dda99a68 ffffffff8154168b ffff8807dda99fd8 0000000000012d00 ffff8807dda98010 0000000000012d00 0000000000012d00 0000000000012d00 Call Trace: [] wq_worker_sleeping+0x15/0xa0 [] __schedule+0x5ab/0x6f0 [] ? put_io_context_active+0xc2/0xf0 [] schedule+0x29/0x70 [] do_exit+0x2d5/0x480 [] oops_end+0xa9/0xf0 [] die+0x5b/0x90 [] do_trap+0xcb/0x170 [] ? __atomic_notifier_call_chain+0x12/0x20 [] do_invalid_op+0x95/0xb0 [] ? put_page+0x48/0x60 [] ? truncate_inode_pages_range+0x201/0x500 [] invalid_op+0x18/0x20 [] ? put_page+0x48/0x60 [] ? truncate_setsize+0x19/0x20 [] aio_free_ring+0x96/0x1c0 [] free_ioctx+0x1f2/0x250 [] ? idle_balance+0xed/0x140 [] put_ioctx+0x1a/0x30 [] kill_ioctx_work+0x2f/0x40 [] process_one_work+0x183/0x490 [] worker_thread+0x120/0x3a0 [] ? manage_workers+0x160/0x160 [] kthread+0xce/0xe0 [] ? kthread_freezable_should_stop+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? kthread_freezable_should_stop+0x70/0x70 Code: 80 05 00 00 48 8b 40 c8 c9 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 80 05 00 00 <48> 8b 40 d8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 RIP [] kthread_data+0x10/0x20 RSP CR2: ffffffffffffffd8 ---[ end trace b5e2c17407c840d9 ]--- DP kernel: -----Fixing recursive fault but reboot is needed! -------[ cut here ]------------ Jul 4 15:49:50 DP kernel: kernel BUG at mm/swap.c:163! Jul 4 15:49:50 DP kernel: invalid opcode: 0000 [#1] SMP Jul 4 15:49:50 DP kernel: Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc cpufreq_ondemand ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core ioatdma i7core_edac edac_core e1000e igb dca i2c_algo_bit i2c_core ptp pps_core ext4(F) jbd2(F) mbcache(F) sd_mod(F) crc_t10dif(F) megaraid_sas(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) Jul 4 15:49:50 DP kernel: CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF 3.10.0-aio-migrate+ #107 Jul 4 15:49:50 DP kernel: Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 Jul 4 15:49:50 DP kernel: Workqueue: events kill_ioctx_work Jul 4 15:49:50 DP kernel: task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 Jul 4 15:49:50 DP kernel: RIP: 0010:[] [] put_page+0x48/0x60 Jul 4 15:49:50 DP kernel: RSP: 0018:ffff8807dda99cd8 EFLAGS: 00010246 Jul 4 15:49:50 DP kernel: RAX: 0000000000000000 RBX: ffff8807be1f1e00 RCX: 0000000000000001 Jul 4 15:49:50 DP kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffea001b196c80 Jul 4 15:49:50 DP kernel: RBP: ffff8807dda99cd8 R08: 0000000000000000 R09: 0000000000000000 Jul 4 15:49:50 DP kernel: R10: ffff8807ffbb5f00 R11: 000000000000005a R12: 0000000000000001 Jul 4 15:49:50 DP kernel: R13: 0000000000000000 R14: ffff8807dda974e0 R15: ffff8807be1f1ec8 Jul 4 15:49:50 DP kernel: FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 Jul 4 15:49:50 DP kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jul 4 15:49:50 DP kernel: CR2: 0000003b826dc7d0 CR3: 0000000001a0b000 CR4: 00000000000007e0 Jul 4 15:49:50 DP kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 4 15:49:50 DP kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jul 4 15:49:50 DP kernel: Stack: Jul 4 15:49:50 DP kernel: ffff8807dda99d18 ffffffff811b11f6 0000000000000000 0000000200000000 Jul 4 15:49:50 DP kernel: ffff8807be1f1e00 ffff8807be1f1e80 000000000000000c 0000000000000000 Jul 4 15:49:50 DP kernel: ffff8807dda99dc8 ffffffff811b21a2 00000001000438ec ffff8807fd692d00 Jul 4 15:49:50 DP kernel: Call Trace: Jul 4 15:49:50 DP kernel: [] aio_free_ring+0x96/0x1c0 Jul 4 15:49:50 DP kernel: [] free_ioctx+0x1f2/0x250 Jul 4 15:49:50 DP kernel: [] ? idle_balance+0xed/0x140 Jul 4 15:49:50 DP kernel: [] put_ioctx+0x1a/0x30 Jul 4 15:49:50 DP kernel: [] kill_ioctx_work+0x2f/0x40 Jul 4 15:49:50 DP kernel: [] process_one_work+0x183/0x490 Jul 4 15:49:50 DP kernel: [] worker_thread+0x120/0x3a0 Jul 4 15:49:50 DP kernel: [] ? manage_workers+0x160/0x160 Jul 4 15:49:50 DP kernel: [] kthread+0xce/0xe0 Jul 4 15:49:50 DP kernel: [] ? kthread_freezable_should_stop+0x70/0x70 Jul 4 15:49:50 DP kernel: [] ret_from_fork+0x7c/0xb0 Jul 4 15:49:50 DP kernel: [] ? kthread_freezable_should_stop+0x70/0x70 Jul 4 15:49:50 DP kernel: Code: 07 00 c0 75 1f f0 ff 4f 1c 0f 94 c0 84 c0 75 0b c9 66 90 c3 0f 1f 80 00 00 00 00 e8 53 fe ff ff c9 66 90 c3 e8 7a fe ff ff c9 c3 <0f> 0b 66 0f 1f 44 00 00 eb f8 48 8b 47 30 eb bc 0f 1f 84 00 00 Jul 4 15:49:50 DP kernel: RIP [] put_page+0x48/0x60 Jul 4 15:49:50 DP kernel: RSP Jul 4 15:49:50 DP kernel: ---[ end trace b5e2c17407c840d8 ]--- INFO: rcu_sched detected stalls on CPUs/tasks: { 4} (detected by 9, t=21056 jiffies, g=4158, c=4157, q=1040) sending NMI to all CPUs: NMI backtrace for cpu 4 CPU: 4 PID: 100 Comm: kworker/4:1 Tainted: GF D 3.10.0-aio-migrate+ #107 Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 89.32 DP Proto 08/16/2012 task: ffff8807dda974e0 ti: ffff8807dda98000 task.ti: ffff8807dda98000 RIP: 0010:[] [] _raw_spin_lock_irq+0x22/0x30 RSP: 0018:ffff8807dda99618 EFLAGS: 00000002 RAX: 000000000000497c RBX: ffff8807fd692d00 RCX: ffff8807dda98010 RDX: 000000000000497e RSI: ffffffff815419a9 RDI: ffff8807fd692d00 RBP: ffff8807dda99618 R08: 0000000000000004 R09: 0000000000000100 R10: 00000000000009fe R11: 00000000000009fe R12: 0000000000000004 R13: 0000000000000009 R14: 0000000000000009 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8807fd680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000001a0b000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8807dda996a8 ffffffff815411b6 ffff8807dda99fd8 0000000000012d00 ffff8807dda98010 0000000000012d00 0000000000012d00 0000000000012d00 ffff8807dda99fd8 0000000000012d00 ffff8807dda974e0 ffff8807dda996c8 Call Trace: [] __schedule+0xd6/0x6f0 [] schedule+0x29/0x70 [] do_exit+0x42a/0x480 [] oops_end+0xa9/0xf0 [] no_context+0x11e/0x1f0 [] __bad_area_nosemaphore+0x11d/0x220 [] bad_area_nosemaphore+0x13/0x20 [] __do_page_fault+0xc5/0x490 [] ? call_rcu_sched+0x17/0x20 [] ? strlcpy+0x4a/0x60 [] do_page_fault+0xe/0x10 [] page_fault+0x22/0x30 [] ? kthread_data+0x10/0x20 [] wq_worker_sleeping+0x15/0xa0 [] __schedule+0x5ab/0x6f0 [] ? put_io_context_active+0xc2/0xf0 [] schedule+0x29/0x70 [] do_exit+0x2d5/0x480 [] oops_end+0xa9/0xf0 [] die+0x5b/0x90 [] do_trap+0xcb/0x170 [] ? __atomic_notifier_call_chain+0x12/0x20 [] do_invalid_op+0x95/0xb0 [] ? put_page+0x48/0x60 [] ? truncate_inode_pages_range+0x201/0x500 [] invalid_op+0x18/0x20 [] ? put_page+0x48/0x60 [] ? truncate_setsize+0x19/0x20 [] aio_free_ring+0x96/0x1c0 [] free_ioctx+0x1f2/0x250 [] ? idle_balance+0xed/0x140 [] put_ioctx+0x1a/0x30 [] kill_ioctx_work+0x2f/0x40 [] process_one_work+0x183/0x490 [] worker_thread+0x120/0x3a0 [] ? manage_workers+0x160/0x160 [] kthread+0xce/0xe0 [] ? kthread_freezable_should_stop+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? kthread_freezable_should_stop+0x70/0x70 Code: 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 fa b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0d 0f 1f 00 f3 90 <0f> b7 07 66 39 c2 75 f6 c9 c3 0f 1f 40 00 55 48 89 e5 66 66 66 NMI backtrace for cpu 1 > > -ben From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756573Ab3GDLl5 (ORCPT ); Thu, 4 Jul 2013 07:41:57 -0400 Received: from kanga.kvack.org ([205.233.56.17]:51070 "EHLO kanga.kvack.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752996Ab3GDLlz (ORCPT ); Thu, 4 Jul 2013 07:41:55 -0400 Date: Thu, 4 Jul 2013 07:41:53 -0400 From: Benjamin LaHaise To: Gu Zheng Cc: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) Message-ID: <20130704114153.GD11006@kvack.org> References: <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> <51D51B66.3000301@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51D51B66.3000301@cn.fujitsu.com> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 04, 2013 at 02:51:18PM +0800, Gu Zheng wrote: > Hi Ben, > When I test your patch on kernel 3.10, the kernel panic when aio job > complete or exit, exactly in aio_free_ring(), the following is a part of dmesg. What is your test case? -ben From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933799Ab3GEDYV (ORCPT ); Thu, 4 Jul 2013 23:24:21 -0400 Received: from cn.fujitsu.com ([222.73.24.84]:63084 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S932143Ab3GEDYU (ORCPT ); Thu, 4 Jul 2013 23:24:20 -0400 X-IronPort-AV: E=Sophos;i="4.87,999,1363104000"; d="scan'208";a="7788779" Message-ID: <51D63B9C.4060204@cn.fujitsu.com> Date: Fri, 05 Jul 2013 11:21:00 +0800 From: Gu Zheng User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20110930 Thunderbird/7.0.1 MIME-Version: 1.0 To: Benjamin LaHaise CC: Tang Chen , Mel Gorman , Minchan Kim , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, zab@redhat.com, jmoyer@redhat.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Marek Szyprowski Subject: Re: [WiP]: aio support for migrating pages (Re: [PATCH V2 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()) References: <20130517002349.GI1008@kvack.org> <5195A3F4.70803@cn.fujitsu.com> <20130517143718.GK1008@kvack.org> <519AD6F8.2070504@cn.fujitsu.com> <20130521022733.GT1008@kvack.org> <51B6F107.80501@cn.fujitsu.com> <20130611144525.GB14404@kvack.org> <51D12E7B.6080301@cn.fujitsu.com> <20130702180008.GQ16399@kvack.org> <51D51B66.3000301@cn.fujitsu.com> <20130704114153.GD11006@kvack.org> In-Reply-To: <20130704114153.GD11006@kvack.org> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/07/05 11:22:43, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/07/05 11:22:47, Serialize complete at 2013/07/05 11:22:47 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/04/2013 07:41 PM, Benjamin LaHaise wrote: > On Thu, Jul 04, 2013 at 02:51:18PM +0800, Gu Zheng wrote: >> Hi Ben, >> When I test your patch on kernel 3.10, the kernel panic when aio job >> complete or exit, exactly in aio_free_ring(), the following is a part of dmesg. > > What is your test case? Just the one you mentioned in the previous mail: http://www.kvack.org/~bcrl/aio/aio-numa-test.c Thanks, Gu > > -ben >