* [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages @ 2013-02-04 10:04 Lin Feng 2013-02-04 10:04 ` [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Lin Feng ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Lin Feng @ 2013-02-04 10:04 UTC (permalink / raw) To: akpm, mgorman, bcrl, viro Cc: khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel, Lin Feng Currently get_user_pages() always tries to allocate pages from movable zone, as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users of get_user_pages() is easy to pin user pages for a long time(for now we found that pages pinned as aio ring pages is such case), which is fatal for memory hotplug/remove framework. So the 1st patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. The 2nd patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works when configed with CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). Lin Feng (2): mm: hotplug: implement non-movable version of get_user_pages() fs/aio.c: use non-movable version of get_user_pages() to pin ring pages when support memory hotremove fs/aio.c | 6 +++++ include/linux/mm.h | 5 ++++ include/linux/mmzone.h | 4 ++++ mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 ++++ 5 files changed, 83 insertions(+) -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-04 10:04 [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Lin Feng @ 2013-02-04 10:04 ` Lin Feng 2013-02-05 0:06 ` Andrew Morton 2013-02-04 10:04 ` [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Lin Feng 2013-02-05 0:58 ` [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Minchan Kim 2 siblings, 1 reply; 34+ messages in thread From: Lin Feng @ 2013-02-04 10:04 UTC (permalink / raw) To: akpm, mgorman, bcrl, viro Cc: khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel, Lin Feng get_user_pages() always tries to allocate pages from movable zone, which is not reliable to memory hotremove framework in some case. This patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com> --- include/linux/mm.h | 5 ++++ include/linux/mmzone.h | 4 ++++ mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ mm/page_isolation.c | 5 ++++ 4 files changed, 77 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 66e2f7c..2a25d0e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas); +#endif struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 73b64a3..5db811e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) return (idx == ZONE_NORMAL); } +static inline int is_movable(struct zone *zone) +{ + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; +} /** * is_highmem - helper function to quickly check if a struct zone is a * highmem zone or not. This is an attempt to keep references diff --git a/mm/memory.c b/mm/memory.c index bb1369f..e3b8e19 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -58,6 +58,8 @@ #include <linux/elf.h> #include <linux/gfp.h> #include <linux/migrate.h> +#include <linux/page-isolation.h> +#include <linux/mm_inline.h> #include <linux/string.h> #include <asm/io.h> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +/** + * It's a wrapper of get_user_pages() but it makes sure that all pages come from + * non-movable zone via additional page migration. + */ +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas) +{ + int ret, i, isolate_err, migrate_pre_flag; + LIST_HEAD(pagelist); + +retry: + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, + vmas); + + isolate_err = 0; + migrate_pre_flag = 0; + + for (i = 0; i < ret; i++) { + if (is_movable(page_zone(pages[i]))) { + if (!migrate_pre_flag) { + if (migrate_prep()) + goto put_page; + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + + page_is_file_cache(pages[i])); + list_add_tail(&pages[i]->lru, &pagelist); + } else { + isolate_err = 1; + goto put_page; + } + } + } + + /* All pages are non movable, we are done :) */ + if (i == ret && list_empty(&pagelist)) + return ret; + +put_page: + /* Undo the effects of former get_user_pages(), we won't pin anything */ + for (i = 0; i < ret; i++) + put_page(pages[i]); + + if (migrate_pre_flag && !isolate_err) { + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, + false, MIGRATE_SYNC, MR_SYSCALL); + /* Steal pages from non-movable zone successfully? */ + if (!ret) + goto retry; + } + + putback_lru_pages(&pagelist); + return 0; +} +EXPORT_SYMBOL(get_user_pages_non_movable); +#endif + /** * get_dump_page() - pin user page in memory while writing it to core dump * @addr: user address diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 383bdbb..1b7bd17 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -247,6 +247,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, return ret ? 0 : -EBUSY; } +/** + * @private: 0 means page can be alloced from movable zone, otherwise forbidden + */ struct page *alloc_migrate_target(struct page *page, unsigned long private, int **resultp) { @@ -254,6 +257,8 @@ struct page *alloc_migrate_target(struct page *page, unsigned long private, if (PageHighMem(page)) gfp_mask |= __GFP_HIGHMEM; + if (unlikely(private != 0)) + gfp_mask &= ~__GFP_MOVABLE; return alloc_page(gfp_mask); } -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-04 10:04 ` [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Lin Feng @ 2013-02-05 0:06 ` Andrew Morton 2013-02-05 0:18 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Andrew Morton @ 2013-02-05 0:06 UTC (permalink / raw) To: Lin Feng Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel melreadthis On Mon, 4 Feb 2013 18:04:07 +0800 Lin Feng <linfeng@cn.fujitsu.com> wrote: > get_user_pages() always tries to allocate pages from movable zone, which is not > reliable to memory hotremove framework in some case. > > This patch introduces a new library function called get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > ... > > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > struct page **pages, struct vm_area_struct **vmas); > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages); > +#ifdef CONFIG_MEMORY_HOTREMOVE > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, int nr_pages, int write, int force, > + struct page **pages, struct vm_area_struct **vmas); > +#endif The ifdefs aren't really needed here and I encourage people to omit them. This keeps the header files looking neater and reduces the chances of things later breaking because we forgot to update some CONFIG_foo logic in a header file. The downside is that errors will be revealed at link time rather than at compile time, but that's a pretty small cost. > struct kvec; > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > struct page **pages); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 73b64a3..5db811e 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > return (idx == ZONE_NORMAL); > } > > +static inline int is_movable(struct zone *zone) > +{ > + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > +} A better name would be zone_is_movable(). We haven't been very consistent about this in mmzone.h, but zone_is_foo() is pretty common. And a neater implementation would be return zone_idx(zone) == ZONE_MOVABLE; All of which made me look at mmzone.h, and what I saw wasn't very nice :( Please review... From: Andrew Morton <akpm@linux-foundation.org> Subject: include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - use zone_idx() in is_highmem() to remove large amounts of silly fluff. Cc: Lin Feng <linfeng@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- include/linux/mmzone.h | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups +++ a/include/linux/mmzone.h @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone->zone_pgdat->node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon static inline int is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || - (zone_off == ZONE_MOVABLE * sizeof(*zone) && - zone_movable_is_highmem()); + enum zone_type idx = zone_idx(zone); + + return idx == ZONE_HIGHMEM || + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); #else return 0; #endif _ > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -58,6 +58,8 @@ > #include <linux/elf.h> > #include <linux/gfp.h> > #include <linux/migrate.h> > +#include <linux/page-isolation.h> > +#include <linux/mm_inline.h> > #include <linux/string.h> > > #include <asm/io.h> > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > } > EXPORT_SYMBOL(get_user_pages); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > +/** > + * It's a wrapper of get_user_pages() but it makes sure that all pages come from > + * non-movable zone via additional page migration. > + */ This needs a description of why the function exists - say something about the requirements of memory hotplug. Also a few words describing how the function works would be good. > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, int nr_pages, int write, int force, > + struct page **pages, struct vm_area_struct **vmas) > +{ > + int ret, i, isolate_err, migrate_pre_flag; > + LIST_HEAD(pagelist); > + > +retry: > + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, > + vmas); We should handle (ret < 0) here. At present the function will silently convert an error return into "return 0", which is bad. The function does appear to otherwise do the right thing if get_user_pages() failed, but only due to good luck. > + isolate_err = 0; > + migrate_pre_flag = 0; > + > + for (i = 0; i < ret; i++) { > + if (is_movable(page_zone(pages[i]))) { > + if (!migrate_pre_flag) { > + if (migrate_prep()) > + goto put_page; > + migrate_pre_flag = 1; > + } > + > + if (!isolate_lru_page(pages[i])) { > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > + page_is_file_cache(pages[i])); > + list_add_tail(&pages[i]->lru, &pagelist); > + } else { > + isolate_err = 1; > + goto put_page; > + } > + } > + } > + > + /* All pages are non movable, we are done :) */ > + if (i == ret && list_empty(&pagelist)) > + return ret; > + > +put_page: > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > + for (i = 0; i < ret; i++) > + put_page(pages[i]); > + > + if (migrate_pre_flag && !isolate_err) { > + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > + false, MIGRATE_SYNC, MR_SYSCALL); > + /* Steal pages from non-movable zone successfully? */ > + if (!ret) > + goto retry; This is buggy. migrate_pages() doesn't empty its `from' argument, so page_list must be reinitialised here (or, better, at the start of the loop). Mel, what's up with migrate_pages()? Shouldn't it be removing the pages from the list when MIGRATEPAGE_SUCCESS? The use of list_for_each_entry_safe() suggests we used to do that... > + } > + > + putback_lru_pages(&pagelist); > + return 0; > +} > +EXPORT_SYMBOL(get_user_pages_non_movable); > +#endif > + Generally, I'd like Mel to go through (the next version of) this patch carefully, please. -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-05 0:06 ` Andrew Morton @ 2013-02-05 0:18 ` Andrew Morton 2013-02-05 3:09 ` Lin Feng 2013-02-05 11:57 ` Mel Gorman 2 siblings, 0 replies; 34+ messages in thread From: Andrew Morton @ 2013-02-05 0:18 UTC (permalink / raw) To: Lin Feng, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel Also... On Mon, 4 Feb 2013 16:06:24 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > > +put_page: > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > > + for (i = 0; i < ret; i++) > > + put_page(pages[i]); We can use release_pages() here. release_pages() is designed to be more efficient when we're putting the final reference to (most of) the pages. It probably has little if any benefit when putting still-in-use pages, as we're doing here. But please consider... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-05 0:06 ` Andrew Morton 2013-02-05 0:18 ` Andrew Morton @ 2013-02-05 3:09 ` Lin Feng 2013-02-05 21:13 ` Andrew Morton 2013-02-05 11:57 ` Mel Gorman 2 siblings, 1 reply; 34+ messages in thread From: Lin Feng @ 2013-02-05 3:09 UTC (permalink / raw) To: Andrew Morton Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel, Tang chen, Gu Zheng Hi Andrew, On 02/05/2013 08:06 AM, Andrew Morton wrote: > > melreadthis > > On Mon, 4 Feb 2013 18:04:07 +0800 > Lin Feng <linfeng@cn.fujitsu.com> wrote: > >> get_user_pages() always tries to allocate pages from movable zone, which is not >> reliable to memory hotremove framework in some case. >> >> This patch introduces a new library function called get_user_pages_non_movable() >> to pin pages only from zone non-movable in memory. >> It's a wrapper of get_user_pages() but it makes sure that all pages come from >> non-movable zone via additional page migration. >> >> ... >> >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, >> struct page **pages, struct vm_area_struct **vmas); >> int get_user_pages_fast(unsigned long start, int nr_pages, int write, >> struct page **pages); >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >> + unsigned long start, int nr_pages, int write, int force, >> + struct page **pages, struct vm_area_struct **vmas); >> +#endif > > The ifdefs aren't really needed here and I encourage people to omit > them. This keeps the header files looking neater and reduces the > chances of things later breaking because we forgot to update some > CONFIG_foo logic in a header file. The downside is that errors will be > revealed at link time rather than at compile time, but that's a pretty > small cost. OK, got it, thanks :) > >> struct kvec; >> int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, >> struct page **pages); >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 73b64a3..5db811e 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) >> return (idx == ZONE_NORMAL); >> } >> >> +static inline int is_movable(struct zone *zone) >> +{ >> + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; >> +} > > A better name would be zone_is_movable(). We haven't been very > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > Yes, zone_is_xxx() should be a better name, but there are some analogous definition like is_dma32() and is_normal() etc, if we only use zone_is_movable() for movable zone it will break such naming rules. Should we update other ones in a separate patch later or just keep the old style? > And a neater implementation would be > > return zone_idx(zone) == ZONE_MOVABLE; > OK. After your change, should we also update for other ones such as is_normal()..? > All of which made me look at mmzone.h, and what I saw wasn't very nice :( > > Please review... > > > From: Andrew Morton <akpm@linux-foundation.org> > Subject: include/linux/mmzone.h: cleanups > > - implement zone_idx() in C to fix its references-args-twice macro bug > > - use zone_idx() in is_highmem() to remove large amounts of silly fluff. > > Cc: Lin Feng <linfeng@cn.fujitsu.com> > Cc: Mel Gorman <mel@csn.ul.ie> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > include/linux/mmzone.h | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h > --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups > +++ a/include/linux/mmzone.h > @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by > /* > * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. > */ > -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) > +static inline enum zone_type zone_idx(struct zone *zone) > +{ > + return zone - zone->zone_pgdat->node_zones; > +} > > static inline int populated_zone(struct zone *zone) > { > @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon > static inline int is_highmem(struct zone *zone) > { > #ifdef CONFIG_HIGHMEM > - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || > - (zone_off == ZONE_MOVABLE * sizeof(*zone) && > - zone_movable_is_highmem()); > + enum zone_type idx = zone_idx(zone); > + > + return idx == ZONE_HIGHMEM || > + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); > #else > return 0; > #endif > _ > > >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -58,6 +58,8 @@ >> #include <linux/elf.h> >> #include <linux/gfp.h> >> #include <linux/migrate.h> >> +#include <linux/page-isolation.h> >> +#include <linux/mm_inline.h> >> #include <linux/string.h> >> >> #include <asm/io.h> >> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, >> } >> EXPORT_SYMBOL(get_user_pages); >> >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> +/** >> + * It's a wrapper of get_user_pages() but it makes sure that all pages come from >> + * non-movable zone via additional page migration. >> + */ > > This needs a description of why the function exists - say something > about the requirements of memory hotplug. > > Also a few words describing how the function works would be good. OK, I will add them in next version. > >> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, >> + unsigned long start, int nr_pages, int write, int force, >> + struct page **pages, struct vm_area_struct **vmas) >> +{ >> + int ret, i, isolate_err, migrate_pre_flag; >> + LIST_HEAD(pagelist); >> + >> +retry: >> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >> + vmas); > > We should handle (ret < 0) here. At present the function will silently > convert an error return into "return 0", which is bad. The function > does appear to otherwise do the right thing if get_user_pages() failed, > but only due to good luck. Sorry, I forgot this, thanks. > >> + isolate_err = 0; >> + migrate_pre_flag = 0; >> + >> + for (i = 0; i < ret; i++) { >> + if (is_movable(page_zone(pages[i]))) { >> + if (!migrate_pre_flag) { >> + if (migrate_prep()) >> + goto put_page; >> + migrate_pre_flag = 1; >> + } >> + >> + if (!isolate_lru_page(pages[i])) { >> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >> + page_is_file_cache(pages[i])); >> + list_add_tail(&pages[i]->lru, &pagelist); >> + } else { >> + isolate_err = 1; >> + goto put_page; >> + } >> + } >> + } >> + >> + /* All pages are non movable, we are done :) */ >> + if (i == ret && list_empty(&pagelist)) >> + return ret; >> + >> +put_page: >> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >> + for (i = 0; i < ret; i++) >> + put_page(pages[i]); >> + >> + if (migrate_pre_flag && !isolate_err) { >> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >> + false, MIGRATE_SYNC, MR_SYSCALL); >> + /* Steal pages from non-movable zone successfully? */ >> + if (!ret) >> + goto retry; > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > page_list must be reinitialised here (or, better, at the start of the loop). migrate_pages() list_for_each_entry_safe() unmap_and_move() if (rc != -EAGAIN) { list_del(&page->lru); } IUUC, the page migrated successfully will be remove from the `from` list here. So if all pages having been migrated successlly, the list will be empty, and we goto retry. Otherwise the rest of the immigrated pages will be handled later by putback_lru_pages(&pagelist), and the function just return. Also there are many places using such logic: 1274 nr_failed = migrate_pages(&pagelist, new_vma_page, 1275 (unsigned long)vma, 1276 false, MIGRATE_SYNC, 1277 MR_MEMPOLICY_MBIND); 1278 if (nr_failed) 1279 putback_lru_pages(&pagelist); Since we traverse the list and handle each page separately, It's likely that we have migrated some page successfully before we encounter a failure. If the pagelist is still carrying the successfully migrated pages when migrate_pages() return, such code is bugyy. > > Mel, what's up with migrate_pages()? Shouldn't it be removing the > pages from the list when MIGRATEPAGE_SUCCESS? The use of > list_for_each_entry_safe() suggests we used to do that... > >> + } >> + >> + putback_lru_pages(&pagelist); >> + return 0; >> +} >> +EXPORT_SYMBOL(get_user_pages_non_movable); >> +#endif >> + > > Generally, I'd like Mel to go through (the next version of) this patch > carefully, please. > > On 02/05/2013 08:18 AM, Andrew Morton wrote:> On Mon, 4 Feb 2013 16:06:24 -0800 > Andrew Morton <akpm@linux-foundation.org> wrote: > >>> > > +put_page: >>> > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> > > + for (i = 0; i < ret; i++) >>> > > + put_page(pages[i]); > We can use release_pages() here. > > release_pages() is designed to be more efficient when we're putting the > final reference to (most of) the pages. It probably has little if any > benefit when putting still-in-use pages, as we're doing here. > > But please consider... OK, I will try :) > thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-05 3:09 ` Lin Feng @ 2013-02-05 21:13 ` Andrew Morton 0 siblings, 0 replies; 34+ messages in thread From: Andrew Morton @ 2013-02-05 21:13 UTC (permalink / raw) To: Lin Feng Cc: mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel, Tang chen, Gu Zheng On Tue, 05 Feb 2013 11:09:27 +0800 Lin Feng <linfeng@cn.fujitsu.com> wrote: > > > >> struct kvec; > >> int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > >> struct page **pages); > >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > >> index 73b64a3..5db811e 100644 > >> --- a/include/linux/mmzone.h > >> +++ b/include/linux/mmzone.h > >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > >> return (idx == ZONE_NORMAL); > >> } > >> > >> +static inline int is_movable(struct zone *zone) > >> +{ > >> + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > >> +} > > > > A better name would be zone_is_movable(). We haven't been very > > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > > > Yes, zone_is_xxx() should be a better name, but there are some analogous > definition like is_dma32() and is_normal() etc, if we only use zone_is_movable() > for movable zone it will break such naming rules. > Should we update other ones in a separate patch later or just keep the old style? I do think the old names were poorly chosen. Yes, we could fix them up sometime but it's hardly a pressing issue. > > And a neater implementation would be > > > > return zone_idx(zone) == ZONE_MOVABLE; > > > OK. After your change, should we also update for other ones such as is_normal()..? Sure. From: Andrew Morton <akpm@linux-foundation.org> Subject: include-linux-mmzoneh-cleanups-fix use zone_idx() some more, further simplify is_highmem() Cc: Lin Feng <linfeng@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- include/linux/mmzone.h | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix +++ a/include/linux/mmzone.h @@ -859,25 +859,18 @@ static inline int is_normal_idx(enum zon */ static inline int is_highmem(struct zone *zone) { -#ifdef CONFIG_HIGHMEM - enum zone_type idx = zone_idx(zone); - - return idx == ZONE_HIGHMEM || - (idx == ZONE_MOVABLE && zone_movable_is_highmem()); -#else - return 0; -#endif + return is_highmem_idx(zone_idx(zone)); } static inline int is_normal(struct zone *zone) { - return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL; + return zone_idx(zone) == ZONE_NORMAL; } static inline int is_dma32(struct zone *zone) { #ifdef CONFIG_ZONE_DMA32 - return zone == zone->zone_pgdat->node_zones + ZONE_DMA32; + return zone_idx(zone) == ZONE_DMA32; #else return 0; #endif @@ -886,7 +879,7 @@ static inline int is_dma32(struct zone * static inline int is_dma(struct zone *zone) { #ifdef CONFIG_ZONE_DMA - return zone == zone->zone_pgdat->node_zones + ZONE_DMA; + return zone_idx(zone) == ZONE_DMA; #else return 0; #endif _ > >> + /* All pages are non movable, we are done :) */ > >> + if (i == ret && list_empty(&pagelist)) > >> + return ret; > >> + > >> +put_page: > >> + /* Undo the effects of former get_user_pages(), we won't pin anything */ > >> + for (i = 0; i < ret; i++) > >> + put_page(pages[i]); > >> + > >> + if (migrate_pre_flag && !isolate_err) { > >> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >> + false, MIGRATE_SYNC, MR_SYSCALL); > >> + /* Steal pages from non-movable zone successfully? */ > >> + if (!ret) > >> + goto retry; > > > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > > page_list must be reinitialised here (or, better, at the start of the loop). > migrate_pages() > list_for_each_entry_safe() > unmap_and_move() > if (rc != -EAGAIN) { > list_del(&page->lru); > } ah, OK, there it is. -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-05 0:06 ` Andrew Morton 2013-02-05 0:18 ` Andrew Morton 2013-02-05 3:09 ` Lin Feng @ 2013-02-05 11:57 ` Mel Gorman 2013-02-05 13:32 ` Mel Gorman ` (2 more replies) 2 siblings, 3 replies; 34+ messages in thread From: Mel Gorman @ 2013-02-05 11:57 UTC (permalink / raw) To: Andrew Morton Cc: Lin Feng, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: > On Mon, 4 Feb 2013 18:04:07 +0800 > Lin Feng <linfeng@cn.fujitsu.com> wrote: > > > get_user_pages() always tries to allocate pages from movable zone, which is not > > reliable to memory hotremove framework in some case. > > > > This patch introduces a new library function called get_user_pages_non_movable() > > to pin pages only from zone non-movable in memory. > > It's a wrapper of get_user_pages() but it makes sure that all pages come from > > non-movable zone via additional page migration. > > > > ... > > > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > struct page **pages, struct vm_area_struct **vmas); > > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > > struct page **pages); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > > + unsigned long start, int nr_pages, int write, int force, > > + struct page **pages, struct vm_area_struct **vmas); > > +#endif > > The ifdefs aren't really needed here and I encourage people to omit > them. This keeps the header files looking neater and reduces the > chances of things later breaking because we forgot to update some > CONFIG_foo logic in a header file. The downside is that errors will be > revealed at link time rather than at compile time, but that's a pretty > small cost. > As an aside, if ifdefs *have* to be used then it often better to have a pattern like #ifdef CONFIG_MEMORY_HOTREMOVE int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas); #else static inline get_user_pages_non_movable(...) { get_user_pages(...) } #endif It eliminates the need for #ifdefs in the C file that calls get_user_pages_non_movable(). > > struct kvec; > > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > > struct page **pages); > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 73b64a3..5db811e 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > > return (idx == ZONE_NORMAL); > > } > > > > +static inline int is_movable(struct zone *zone) > > +{ > > + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > > +} > > A better name would be zone_is_movable(). He is matching the names of the is_normal(), is_dma32() and is_dma() helpers. Unfortunately I would expect a name like is_movable() to return true if the page had either been allocated with __GFP_MOVABLE or if it was checking migrate can currently handle the page. is_movable() is indeed a terrible name. They should be renamed or deleted in preparation -- see below. > We haven't been very > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > > And a neater implementation would be > > return zone_idx(zone) == ZONE_MOVABLE; > Yes. > All of which made me look at mmzone.h, and what I saw wasn't very nice :( > > Please review... > > > From: Andrew Morton <akpm@linux-foundation.org> > Subject: include/linux/mmzone.h: cleanups > > - implement zone_idx() in C to fix its references-args-twice macro bug > > - use zone_idx() in is_highmem() to remove large amounts of silly fluff. > > Cc: Lin Feng <linfeng@cn.fujitsu.com> > Cc: Mel Gorman <mel@csn.ul.ie> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > include/linux/mmzone.h | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h > --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups > +++ a/include/linux/mmzone.h > @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by > /* > * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. > */ > -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) > +static inline enum zone_type zone_idx(struct zone *zone) > +{ > + return zone - zone->zone_pgdat->node_zones; > +} > > static inline int populated_zone(struct zone *zone) > { > @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon > static inline int is_highmem(struct zone *zone) > { > #ifdef CONFIG_HIGHMEM > - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || > - (zone_off == ZONE_MOVABLE * sizeof(*zone) && > - zone_movable_is_highmem()); > + enum zone_type idx = zone_idx(zone); > + > + return idx == ZONE_HIGHMEM || > + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); > #else > return 0; > #endif *blinks*. Ok, while I think your version looks nicer, it is reverting ddc81ed2 (remove sparse warning for mmzone.h). According to my version of gcc at least, your patch reintroduces the sar and sparse complains if run as make ARCH=i386 C=2 CF=-Wsparse-all .... this? From: Mel Gorman <mgorman@suse.de> Subject: [PATCH] include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - implement zone_is_idx in an attempt to explain what the fluff in is_highmem is for - rename is_highmem to zone_is_highmem as preparation for zone_is_movable() - implement zone_is_movable as a helper for places that care about ZONE_MOVABLE - Remove is_normal, is_dma32 and is_dma because apparently no one cares - convert users of zone_idx(zone) == ZONE_FOO to appropriate zone_is_foo() helper - Use is_highmem_idx() instead of is_highmem for the PageHighMem implementation. Should be functionally equivalent because we only care about the zones index, not exactly what zone it is [akpm@linux-foundation.org: zone_idx() reimplementation] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> --- arch/s390/mm/init.c | 2 +- arch/tile/mm/init.c | 5 +---- arch/x86/mm/highmem_32.c | 2 +- include/linux/mmzone.h | 49 +++++++++++++++----------------------------- include/linux/page-flags.h | 2 +- kernel/power/snapshot.c | 12 +++++------ mm/page_alloc.c | 6 +++--- 7 files changed, 30 insertions(+), 48 deletions(-) diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index 81e596c..c26c321 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -231,7 +231,7 @@ int arch_add_memory(int nid, u64 start, u64 size) if (rc) return rc; for_each_zone(zone) { - if (zone_idx(zone) != ZONE_MOVABLE) { + if (!zone_is_movable(zone)) { /* Add range within existing zone limits */ zone_start_pfn = zone->zone_start_pfn; zone_end_pfn = zone->zone_start_pfn + diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c index ef29d6c..8287749 100644 --- a/arch/tile/mm/init.c +++ b/arch/tile/mm/init.c @@ -733,9 +733,6 @@ static void __init set_non_bootmem_pages_init(void) for_each_zone(z) { unsigned long start, end; int nid = z->zone_pgdat->node_id; -#ifdef CONFIG_HIGHMEM - int idx = zone_idx(z); -#endif start = z->zone_start_pfn; end = start + z->spanned_pages; @@ -743,7 +740,7 @@ static void __init set_non_bootmem_pages_init(void) start = max(start, max_low_pfn); #ifdef CONFIG_HIGHMEM - if (idx == ZONE_HIGHMEM) + if (zone_is_highmem(z)) totalhigh_pages += z->spanned_pages; #endif if (kdata_huge) { diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c index 6f31ee5..63020e7 100644 --- a/arch/x86/mm/highmem_32.c +++ b/arch/x86/mm/highmem_32.c @@ -124,7 +124,7 @@ void __init set_highmem_pages_init(void) for_each_zone(zone) { unsigned long zone_start_pfn, zone_end_pfn; - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) continue; zone_start_pfn = zone->zone_start_pfn; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a23923b..5ecef75 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -779,10 +779,20 @@ static inline int local_memory_node(int node_id) { return node_id; }; unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long); #endif +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) +{ + /* This mess avoids a potentially expensive pointer subtraction. */ + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; + return zone_off == idx * sizeof(*zone); +} + /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone->zone_pgdat->node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -810,50 +820,25 @@ static inline int is_highmem_idx(enum zone_type idx) #endif } -static inline int is_normal_idx(enum zone_type idx) -{ - return (idx == ZONE_NORMAL); -} - /** - * is_highmem - helper function to quickly check if a struct zone is a + * zone_is_highmem - helper function to quickly check if a struct zone is a * highmem zone or not. This is an attempt to keep references * to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum. * @zone - pointer to struct zone variable */ -static inline int is_highmem(struct zone *zone) +static inline bool zone_is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || - (zone_off == ZONE_MOVABLE * sizeof(*zone) && - zone_movable_is_highmem()); -#else - return 0; -#endif -} - -static inline int is_normal(struct zone *zone) -{ - return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL; -} - -static inline int is_dma32(struct zone *zone) -{ -#ifdef CONFIG_ZONE_DMA32 - return zone == zone->zone_pgdat->node_zones + ZONE_DMA32; + return zone_is_idx(zone, ZONE_HIGHMEM) || + (zone_is_idx(zone, ZONE_MOVABLE) && zone_movable_is_highmem()); #else return 0; #endif } -static inline int is_dma(struct zone *zone) +static inline bool zone_is_movable(struct zone *zone) { -#ifdef CONFIG_ZONE_DMA - return zone == zone->zone_pgdat->node_zones + ZONE_DMA; -#else - return 0; -#endif + return zone_is_idx(zone, ZONE_MOVABLE); } /* These two functions are used to setup the per zone pages min values */ diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index b5d1384..2f541f1 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -237,7 +237,7 @@ PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */ * Must use a macro here due to header dependency issues. page_zone() is not * available at this point. */ -#define PageHighMem(__p) is_highmem(page_zone(__p)) +#define PageHighMem(__p) is_highmem_idx(page_zonenum(__p)) #else PAGEFLAG_FALSE(HighMem) #endif diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index 0de2857..cbf0da9 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -830,7 +830,7 @@ static unsigned int count_free_highmem_pages(void) unsigned int cnt = 0; for_each_populated_zone(zone) - if (is_highmem(zone)) + if (zone_is_highmem(zone)) cnt += zone_page_state(zone, NR_FREE_PAGES); return cnt; @@ -879,7 +879,7 @@ static unsigned int count_highmem_pages(void) for_each_populated_zone(zone) { unsigned long pfn, max_zone_pfn; - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) continue; mark_free_pages(zone); @@ -943,7 +943,7 @@ static unsigned int count_data_pages(void) unsigned int n = 0; for_each_populated_zone(zone) { - if (is_highmem(zone)) + if (zone_is_highmem(zone)) continue; mark_free_pages(zone); @@ -989,7 +989,7 @@ static void safe_copy_page(void *dst, struct page *s_page) static inline struct page * page_is_saveable(struct zone *zone, unsigned long pfn) { - return is_highmem(zone) ? + return zone_is_highmem(zone) ? saveable_highmem_page(zone, pfn) : saveable_page(zone, pfn); } @@ -1338,7 +1338,7 @@ int hibernate_preallocate_memory(void) size = 0; for_each_populated_zone(zone) { size += snapshot_additional_pages(zone); - if (is_highmem(zone)) + if (zone_is_highmem(zone)) highmem += zone_page_state(zone, NR_FREE_PAGES); else count += zone_page_state(zone, NR_FREE_PAGES); @@ -1481,7 +1481,7 @@ static int enough_free_mem(unsigned int nr_pages, unsigned int nr_highmem) unsigned int free = alloc_normal; for_each_populated_zone(zone) - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) free += zone_page_state(zone, NR_FREE_PAGES); nr_pages += count_pages_for_highmem(nr_highmem); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7e208f0..9558341 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5136,7 +5136,7 @@ static void __setup_per_zone_wmarks(void) /* Calculate total number of !ZONE_HIGHMEM pages */ for_each_zone(zone) { - if (!is_highmem(zone)) + if (!zone_is_highmem(zone)) lowmem_pages += zone->present_pages; } @@ -5146,7 +5146,7 @@ static void __setup_per_zone_wmarks(void) spin_lock_irqsave(&zone->lock, flags); tmp = (u64)pages_min * zone->present_pages; do_div(tmp, lowmem_pages); - if (is_highmem(zone)) { + if (zone_is_highmem(zone)) { /* * __GFP_HIGH and PF_MEMALLOC allocations usually don't * need highmem pages, so cap pages_min to a small @@ -5585,7 +5585,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count) * For avoiding noise data, lru_add_drain_all() should be called * If ZONE_MOVABLE, the zone never contains unmovable pages */ - if (zone_idx(zone) == ZONE_MOVABLE) + if (zone_is_movable(zone)) return false; mt = get_pageblock_migratetype(page); if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt)) > _ > > > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -58,6 +58,8 @@ > > #include <linux/elf.h> > > #include <linux/gfp.h> > > #include <linux/migrate.h> > > +#include <linux/page-isolation.h> > > +#include <linux/mm_inline.h> > > #include <linux/string.h> > > > > #include <asm/io.h> > > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > } > > EXPORT_SYMBOL(get_user_pages); > > > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > +/** > > + * It's a wrapper of get_user_pages() but it makes sure that all pages come from > > + * non-movable zone via additional page migration. > > + */ > > This needs a description of why the function exists - say something > about the requirements of memory hotplug. > > Also a few words describing how the function works would be good. > > > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > > + unsigned long start, int nr_pages, int write, int force, > > + struct page **pages, struct vm_area_struct **vmas) > > +{ > > + int ret, i, isolate_err, migrate_pre_flag; migrate_pre_flag should be bool and the name is a bit unusual. migrate_prepped? > > + LIST_HEAD(pagelist); > > + > > +retry: > > + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, > > + vmas); > > We should handle (ret < 0) here. At present the function will silently > convert an error return into "return 0", which is bad. The function > does appear to otherwise do the right thing if get_user_pages() failed, > but only due to good luck. > A BUG_ON check if we retry more than once wouldn't hurt either. It requires a broken implementation of alloc_migrate_target() but someone might "fix" it and miss this. A minor aside, it would be nice if we exited at this point if there was no populated ZONE_MOVABLE in the system. There is a movable_zone global variable already. If it was forced to be MAX_NR_ZONES before the call to find_usable_zone_for_movable() in memory initialisation we should be able to make a cheap check for it here. > > + isolate_err = 0; > > + migrate_pre_flag = 0; > > + > > + for (i = 0; i < ret; i++) { > > + if (is_movable(page_zone(pages[i]))) { This is a relatively expensive lookup. If my patch above is used then you can add a is_movable_idx helper and then this becomes if (is_movable_idx(page_zonenum(pages[i]))) which is cheaper. > > + if (!migrate_pre_flag) { > > + if (migrate_prep()) > > + goto put_page; CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never return failure. You could make this BUG_ON(migrate_prep()), remove this goto and the migrate_pre_flag check below becomes redundant. > > + migrate_pre_flag = 1; > > + } > > + > > + if (!isolate_lru_page(pages[i])) { > > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > > + page_is_file_cache(pages[i])); > > + list_add_tail(&pages[i]->lru, &pagelist); > > + } else { > > + isolate_err = 1; > > + goto put_page; > > + } isolate_lru_page() takes the LRU lock every time. If get_user_pages_non_movable is used heavily then you may encounter lock contention problems. Batching this lock would be a separate patch and should not be necessary yet but you should at least comment on it as a reminder. I think that a goto could also have been avoided here if you used break. The i == ret check below would be false and it would just fall through. Overall the flow would be a bit easier to follow. Why list_add_tail()? I don't think it's wrong but it's unusual to see list_add_tail() when list_add() is enough. > > + } > > + } > > + > > + /* All pages are non movable, we are done :) */ > > + if (i == ret && list_empty(&pagelist)) > > + return ret; > > + > > +put_page: > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > > + for (i = 0; i < ret; i++) > > + put_page(pages[i]); > > + release_pages. That comment is insufficient. There are non-obvious consequences to this logic. We are dropping pins on all pages regardless of what zone they are in. If the subsequent migration fails then we end up returning 0 with no pages pinned. The user-visible effect is that io_setup() fails for non-obvious reasons. It will return EAGAIN to userspace which will be interpreted as "The specified nr_events exceeds the user's limit of available events.". The application will either fail or potentially infinite loop if the developer interpreted EAGAIN as "try again" as opposed to "this is a permanent failure". Is that deliberate? Is it really preferable that AIO can fail to setup and the application exit just in case we want to hot-remove memory later? Should a failed migration generate a WARN_ON at least? I would think that it's better to WARN_ON_ONCE if migration fails but pin the pages as requested. If a future memory hot-remove operation fails then the warning will indicate why but applications will not fail as a result. It's a little clumsy but the memory hot-remove failure message could list what applications have pinned the pages that cannot be removed so the administrator has the option of force-killing the application. It is possible to discover what application is pinning a page from userspace but it would involve an expensive search with /proc/kpagemap > > + if (migrate_pre_flag && !isolate_err) { > > + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > > + false, MIGRATE_SYNC, MR_SYSCALL); The conversion of alloc_migrate_target is a bit problematic. It strips the __GFP_MOVABLE flag and the consequence of this is that it converts those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large number of pages, particularly if the number of get_user_pages_non_movable() increases for short-lived pins like direct IO. One way around this is to add a high_zoneidx parameter to __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask) as the high_zoneidx and create a new migration allocation function that passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE > > + /* Steal pages from non-movable zone successfully? */ > > + if (!ret) > > + goto retry; > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > page_list must be reinitialised here (or, better, at the start of the loop). > page_list should be empty if ret == 0. > Mel, what's up with migrate_pages()? Shouldn't it be removing the > pages from the list when MIGRATEPAGE_SUCCESS? The use of > list_for_each_entry_safe() suggests we used to do that... > On successful migration, the pages are put back on the LRU at the end of unmap_and_move(). On migration failure the caller is responsible for putting the pages back on the LRU with putback_lru_pages() which happens below. > > + } > > + > > + putback_lru_pages(&pagelist); > > + return 0; > > +} > > +EXPORT_SYMBOL(get_user_pages_non_movable); > > +#endif > > + > -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-05 11:57 ` Mel Gorman @ 2013-02-05 13:32 ` Mel Gorman 2013-02-19 13:37 ` Lin Feng 2013-02-20 9:58 ` Simon Jeons 2013-02-06 2:26 ` Michel Lespinasse 2013-02-18 10:34 ` Lin Feng 2 siblings, 2 replies; 34+ messages in thread From: Mel Gorman @ 2013-02-05 13:32 UTC (permalink / raw) To: Andrew Morton Cc: Lin Feng, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel, linux-kernel On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote: > > > > + migrate_pre_flag = 1; > > > + } > > > + > > > + if (!isolate_lru_page(pages[i])) { > > > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > > > + page_is_file_cache(pages[i])); > > > + list_add_tail(&pages[i]->lru, &pagelist); > > > + } else { > > > + isolate_err = 1; > > > + goto put_page; > > > + } > > isolate_lru_page() takes the LRU lock every time. Credit to Michal Hocko for bringing this up but with the number of other issues I missed that this is also broken with respect to huge page handling. hugetlbfs pages will not be on the LRU so the isolation will mess up and the migration has to be handled differently. Ordinarily hugetlbfs pages cannot be allocated from ZONE_MOVABLE but it is possible to configure it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this encounters a hugetlbfs page, it'll just blow up. The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok because the caller only cares about the physical address. Migration will also split a hugepage if it receives it but you are potentially adding tail pages to a list here and then migrating them. The split of the first page will get very confused. I'm not exactly sure what the result will be but it won't be pretty. Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled during testing? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-05 13:32 ` Mel Gorman @ 2013-02-19 13:37 ` Lin Feng 2013-02-20 2:34 ` Lin Feng 2013-02-20 9:58 ` Simon Jeons 1 sibling, 1 reply; 34+ messages in thread From: Lin Feng @ 2013-02-19 13:37 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel, linux-kernel Hi Mel, On 02/05/2013 09:32 PM, Mel Gorman wrote: > On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote: >> >>>> + migrate_pre_flag = 1; >>>> + } >>>> + >>>> + if (!isolate_lru_page(pages[i])) { >>>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>>> + page_is_file_cache(pages[i])); >>>> + list_add_tail(&pages[i]->lru, &pagelist); >>>> + } else { >>>> + isolate_err = 1; >>>> + goto put_page; >>>> + } >> >> isolate_lru_page() takes the LRU lock every time. > > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. I look into the migrate_huge_page() codes find that if we support the hugetlbfs non movable migration, we have to invent another alloc_huge_page_node_nonmovable() or such allocate interface, which cost is large(exploding the codes and great impact on current alloc_huge_page_node()) but gains little, I think that pinning hugepage is a corner case. So can we skip over hugepage without migration but give some WARN_ON() info, is it acceptable? > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok I can't find codes doing such things :(, could you please point me out? > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? I checked my config file that both CONFIG options aboved are enabled. However it was only be tested by two services invoking io_setup(), it works fine.. thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-19 13:37 ` Lin Feng @ 2013-02-20 2:34 ` Lin Feng 2013-02-20 2:44 ` Wanpeng Li ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Lin Feng @ 2013-02-20 2:34 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel, linux-kernel On 02/19/2013 09:37 PM, Lin Feng wrote: >> > >> > The other is that this almost certainly broken for transhuge page >> > handling. gup returns the head and tail pages and ordinarily this is ok > I can't find codes doing such things :(, could you please point me out? > Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. flee... thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-20 2:34 ` Lin Feng @ 2013-02-20 2:44 ` Wanpeng Li 2013-02-20 2:44 ` Wanpeng Li [not found] ` <20130220024435.GA30208@hacker.(null)> 2 siblings, 0 replies; 34+ messages in thread From: Wanpeng Li @ 2013-02-20 2:44 UTC (permalink / raw) To: Lin Feng Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel, linux-kernel On Wed, Feb 20, 2013 at 10:34:36AM +0800, Lin Feng wrote: > > >On 02/19/2013 09:37 PM, Lin Feng wrote: >>> > >>> > The other is that this almost certainly broken for transhuge page >>> > handling. gup returns the head and tail pages and ordinarily this is ok >> I can't find codes doing such things :(, could you please point me out? >> >Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. >flee... According to the compound page, the first page of compound page is called head page, other sub pages are called tail pages. Regards, Wanpeng Li > >thanks, >linfeng > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-20 2:34 ` Lin Feng 2013-02-20 2:44 ` Wanpeng Li @ 2013-02-20 2:44 ` Wanpeng Li [not found] ` <20130220024435.GA30208@hacker.(null)> 2 siblings, 0 replies; 34+ messages in thread From: Wanpeng Li @ 2013-02-20 2:44 UTC (permalink / raw) To: Lin Feng Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel, linux-kernel On Wed, Feb 20, 2013 at 10:34:36AM +0800, Lin Feng wrote: > > >On 02/19/2013 09:37 PM, Lin Feng wrote: >>> > >>> > The other is that this almost certainly broken for transhuge page >>> > handling. gup returns the head and tail pages and ordinarily this is ok >> I can't find codes doing such things :(, could you please point me out? >> >Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. >flee... According to the compound page, the first page of compound page is called head page, other sub pages are called tail pages. Regards, Wanpeng Li > >thanks, >linfeng > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
[parent not found: <20130220024435.GA30208@hacker.(null)>]
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() [not found] ` <20130220024435.GA30208@hacker.(null)> @ 2013-02-20 2:59 ` Lin Feng 0 siblings, 0 replies; 34+ messages in thread From: Lin Feng @ 2013-02-20 2:59 UTC (permalink / raw) To: Wanpeng Li Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel, linux-kernel Hi Wanpeng, On 02/20/2013 10:44 AM, Wanpeng Li wrote: >> Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. >> >flee... > According to the compound page, the first page of compound page is > called head page, other sub pages are called tail pages. > > Regards, > Wanpeng Li > Yes, you are right, thanks for explaining. I thought it just means only the last one of the compound pages.. thanks, linfeng ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-05 13:32 ` Mel Gorman 2013-02-19 13:37 ` Lin Feng @ 2013-02-20 9:58 ` Simon Jeons 2013-02-20 10:23 ` Lin Feng 1 sibling, 1 reply; 34+ messages in thread From: Simon Jeons @ 2013-02-20 9:58 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Lin Feng, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel, linux-kernel On 02/05/2013 09:32 PM, Mel Gorman wrote: > On Tue, Feb 05, 2013 at 11:57:22AM +0000, Mel Gorman wrote: >>>> + migrate_pre_flag = 1; >>>> + } >>>> + >>>> + if (!isolate_lru_page(pages[i])) { >>>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>>> + page_is_file_cache(pages[i])); >>>> + list_add_tail(&pages[i]->lru, &pagelist); >>>> + } else { >>>> + isolate_err = 1; >>>> + goto put_page; >>>> + } >> isolate_lru_page() takes the LRU lock every time. > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. As you said, hugetlbfs pages are not in LRU list, then how can encounter a hugetlbfs page and blow up? > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok When need gup thp? in kvm case? > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-20 9:58 ` Simon Jeons @ 2013-02-20 10:23 ` Lin Feng 2013-02-20 11:31 ` Simon Jeons 0 siblings, 1 reply; 34+ messages in thread From: Lin Feng @ 2013-02-20 10:23 UTC (permalink / raw) To: Simon Jeons Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel, linux-kernel Hi Simon, On 02/20/2013 05:58 PM, Simon Jeons wrote: > >> >> The other is that this almost certainly broken for transhuge page >> handling. gup returns the head and tail pages and ordinarily this is ok > > When need gup thp? in kvm case? gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. We can't expect the pages have been allocated for user address space are thp or normal page. So we need to deal with them and I think it have nothing to do with kvm. thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-20 10:23 ` Lin Feng @ 2013-02-20 11:31 ` Simon Jeons 2013-02-20 11:54 ` Lin Feng 0 siblings, 1 reply; 34+ messages in thread From: Simon Jeons @ 2013-02-20 11:31 UTC (permalink / raw) To: Lin Feng Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel, linux-kernel On 02/20/2013 06:23 PM, Lin Feng wrote: > Hi Simon, > > On 02/20/2013 05:58 PM, Simon Jeons wrote: >>> The other is that this almost certainly broken for transhuge page >>> handling. gup returns the head and tail pages and ordinarily this is ok >> When need gup thp? in kvm case? > gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. > We can't expect the pages have been allocated for user address space are thp or > normal page. So we need to deal with them and I think it have nothing to do with kvm. Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present? > > thanks, > linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-20 11:31 ` Simon Jeons @ 2013-02-20 11:54 ` Lin Feng 0 siblings, 0 replies; 34+ messages in thread From: Lin Feng @ 2013-02-20 11:54 UTC (permalink / raw) To: Simon Jeons Cc: Mel Gorman, Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, mhocko, linux-mm, linux-aio, linux-fsdevel, linux-kernel On 02/20/2013 07:31 PM, Simon Jeons wrote: > On 02/20/2013 06:23 PM, Lin Feng wrote: >> Hi Simon, >> >> On 02/20/2013 05:58 PM, Simon Jeons wrote: >>>> The other is that this almost certainly broken for transhuge page >>>> handling. gup returns the head and tail pages and ordinarily this is ok >>> When need gup thp? in kvm case? >> gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. >> We can't expect the pages have been allocated for user address space are thp or >> normal page. So we need to deal with them and I think it have nothing to do with kvm. > > Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present? No, userspace process doesn't pin any pages directly but through some syscalls like io_setup() indirectly for other purpose because kernel can't pagefault and it have to keep the page alive. Kernel wants to communicate with the userspace such as to notify some events so it need some sort of buffer that both Kernel and User space can both access, which leads to so called pin pages by gup. thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-05 11:57 ` Mel Gorman 2013-02-05 13:32 ` Mel Gorman @ 2013-02-06 2:26 ` Michel Lespinasse 2013-02-06 10:41 ` Mel Gorman 2013-02-18 10:34 ` Lin Feng 2 siblings, 1 reply; 34+ messages in thread From: Michel Lespinasse @ 2013-02-06 2:26 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Lin Feng, bcrl, viro, khlebnikov, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel Just nitpicking, but: On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman <mgorman@suse.de> wrote: > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) > +{ > + /* This mess avoids a potentially expensive pointer subtraction. */ > + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > + return zone_off == idx * sizeof(*zone); > +} Maybe: return zone == zone->zone_pgdat->node_zones + idx; ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-06 2:26 ` Michel Lespinasse @ 2013-02-06 10:41 ` Mel Gorman 0 siblings, 0 replies; 34+ messages in thread From: Mel Gorman @ 2013-02-06 10:41 UTC (permalink / raw) To: Michel Lespinasse Cc: Andrew Morton, Lin Feng, bcrl, viro, khlebnikov, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel On Tue, Feb 05, 2013 at 06:26:51PM -0800, Michel Lespinasse wrote: > Just nitpicking, but: > > On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman <mgorman@suse.de> wrote: > > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) > > +{ > > + /* This mess avoids a potentially expensive pointer subtraction. */ > > + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > > + return zone_off == idx * sizeof(*zone); > > +} > > Maybe: > return zone == zone->zone_pgdat->node_zones + idx; > ? > Not a nit at all. Yours is more readable but it generates more code. A single line function that uses the helper generates 0x3f bytes of code (mostly function entry/exit) with your version and 0x39 bytes with mine. The difference in efficiency is marginal as your version uses lea to multiply by a constant but it's still slightly heavier. The old code is fractionally better, your version is more readable so it's up to Andrew really. Right now I think he's gone with his own version with zone_idx() in the name of readability whatever sparse has to say about the matter. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-05 11:57 ` Mel Gorman 2013-02-05 13:32 ` Mel Gorman 2013-02-06 2:26 ` Michel Lespinasse @ 2013-02-18 10:34 ` Lin Feng 2013-02-18 15:17 ` Mel Gorman 2 siblings, 1 reply; 34+ messages in thread From: Lin Feng @ 2013-02-18 10:34 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel Hi Mel, See below. On 02/05/2013 07:57 PM, Mel Gorman wrote: > On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: >> The ifdefs aren't really needed here and I encourage people to omit >> them. This keeps the header files looking neater and reduces the >> chances of things later breaking because we forgot to update some >> CONFIG_foo logic in a header file. The downside is that errors will be >> revealed at link time rather than at compile time, but that's a pretty >> small cost. >> > > As an aside, if ifdefs *have* to be used then it often better to have a > pattern like > > #ifdef CONFIG_MEMORY_HOTREMOVE > int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > unsigned long start, int nr_pages, int write, int force, > struct page **pages, struct vm_area_struct **vmas); > #else > static inline get_user_pages_non_movable(...) > { > get_user_pages(...) > } > #endif > > It eliminates the need for #ifdefs in the C file that calls > get_user_pages_non_movable(). Thanks for pointing out :) >>> + >>> +retry: >>> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >>> + vmas); >> >> We should handle (ret < 0) here. At present the function will silently >> convert an error return into "return 0", which is bad. The function >> does appear to otherwise do the right thing if get_user_pages() failed, >> but only due to good luck. >> > > A BUG_ON check if we retry more than once wouldn't hurt either. It requires > a broken implementation of alloc_migrate_target() but someone might "fix" > it and miss this. Agree. > > A minor aside, it would be nice if we exited at this point if there was > no populated ZONE_MOVABLE in the system. There is a movable_zone global > variable already. If it was forced to be MAX_NR_ZONES before the call to > find_usable_zone_for_movable() in memory initialisation we should be able > to make a cheap check for it here. OK. >>> + if (!migrate_pre_flag) { >>> + if (migrate_prep()) >>> + goto put_page; > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > return failure. You could make this BUG_ON(migrate_prep()), remove this > goto and the migrate_pre_flag check below becomes redundant. Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check will call/check migrate_prep() every time we isolate a single page, do we have to do so? I mean that for a group of pages it's sufficient to call migrate_prep() only once. There are some other migrate-relative codes using this scheme. So I get confused, do I miss something or the the performance cost is little? :( > >>> + migrate_pre_flag = 1; >>> + } >>> + >>> + if (!isolate_lru_page(pages[i])) { >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>> + page_is_file_cache(pages[i])); >>> + list_add_tail(&pages[i]->lru, &pagelist); >>> + } else { >>> + isolate_err = 1; >>> + goto put_page; >>> + } > > isolate_lru_page() takes the LRU lock every time. If > get_user_pages_non_movable is used heavily then you may encounter lock > contention problems. Batching this lock would be a separate patch and should > not be necessary yet but you should at least comment on it as a reminder. Farsighted, should improve it in the future. > > I think that a goto could also have been avoided here if you used break. The > i == ret check below would be false and it would just fall through. > Overall the flow would be a bit easier to follow. Yes, I noticed this before. I thought using goto could save some micro ops (here is only the i == ret check) though lower the readability but still be clearly. Also there are some other place in current kernel facing such performance/readability race condition, but it seems that the developers prefer readability, why? While I think performance is fatal to kernel.. > > Why list_add_tail()? I don't think it's wrong but it's unusual to see > list_add_tail() when list_add() is enough. Sorry, not intentional, just steal from other codes without thinking more ;-) > >>> + } >>> + } >>> + >>> + /* All pages are non movable, we are done :) */ >>> + if (i == ret && list_empty(&pagelist)) >>> + return ret; >>> + >>> +put_page: >>> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> + for (i = 0; i < ret; i++) >>> + put_page(pages[i]); >>> + > > release_pages. > > That comment is insufficient. There are non-obvious consequences to this > logic. We are dropping pins on all pages regardless of what zone they > are in. If the subsequent migration fails then we end up returning 0 > with no pages pinned. The user-visible effect is that io_setup() fails > for non-obvious reasons. It will return EAGAIN to userspace which will be > interpreted as "The specified nr_events exceeds the user's limit of available > events.". The application will either fail or potentially infinite loop > if the developer interpreted EAGAIN as "try again" as opposed to "this is > a permanent failure". Yes, I missed such things. > > Is that deliberate? Is it really preferable that AIO can fail to setup > and the application exit just in case we want to hot-remove memory later? > Should a failed migration generate a WARN_ON at least? > > I would think that it's better to WARN_ON_ONCE if migration fails but pin > the pages as requested. If a future memory hot-remove operation fails > then the warning will indicate why but applications will not fail as a I'm so agree, I will keep the pin state alive and issue a WARN_ON_ONCE message in such case. > result. It's a little clumsy but the memory hot-remove failure message > could list what applications have pinned the pages that cannot be removed > so the administrator has the option of force-killing the application. It > is possible to discover what application is pinning a page from userspace > but it would involve an expensive search with /proc/kpagemap > >>> + if (migrate_pre_flag && !isolate_err) { >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > The conversion of alloc_migrate_target is a bit problematic. It strips > the __GFP_MOVABLE flag and the consequence of this is that it converts > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > number of pages, particularly if the number of get_user_pages_non_movable() > increases for short-lived pins like direct IO. Sorry, I don't quite understand here neither. If we use the following new migration allocation function as you said, the increasing number of get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE pages. What's the difference, do I miss something? > > One way around this is to add a high_zoneidx parameter to > __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new > inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask) > as the high_zoneidx and create a new migration allocation function that > passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to > happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE On 02/05/2013 09:32 PM, Mel Gorman wrote: > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? Sorry, I hadn't considered THP and hugepage yet, I will deal such cases based on your comments later. thanks a lot for your review :) linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-18 10:34 ` Lin Feng @ 2013-02-18 15:17 ` Mel Gorman 2013-02-19 9:55 ` Lin Feng 0 siblings, 1 reply; 34+ messages in thread From: Mel Gorman @ 2013-02-18 15:17 UTC (permalink / raw) To: Lin Feng Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel On Mon, Feb 18, 2013 at 06:34:44PM +0800, Lin Feng wrote: > >>> + if (!migrate_pre_flag) { > >>> + if (migrate_prep()) > >>> + goto put_page; > > > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > > return failure. You could make this BUG_ON(migrate_prep()), remove this > > goto and the migrate_pre_flag check below becomes redundant. > Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check > will call/check migrate_prep() every time we isolate a single page, do we have > to do so? I was referring to the migrate_pre_flag check further down and I'm not suggesting it be removed from here as you do want to call migrate_prep() only once. However, as it'll never return failure for any kernel configuration that allows memory hot-remove, this goto can be removed and the flow simplified. > > > >>> + migrate_pre_flag = 1; > >>> + } > >>> + > >>> + if (!isolate_lru_page(pages[i])) { > >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > >>> + page_is_file_cache(pages[i])); > >>> + list_add_tail(&pages[i]->lru, &pagelist); > >>> + } else { > >>> + isolate_err = 1; > >>> + goto put_page; > >>> + } > > > > isolate_lru_page() takes the LRU lock every time. If > > get_user_pages_non_movable is used heavily then you may encounter lock > > contention problems. Batching this lock would be a separate patch and should > > not be necessary yet but you should at least comment on it as a reminder. > Farsighted, should improve it in the future. > > > > > I think that a goto could also have been avoided here if you used break. The > > i == ret check below would be false and it would just fall through. > > Overall the flow would be a bit easier to follow. > Yes, I noticed this before. I thought using goto could save some micro ops > (here is only the i == ret check) though lower the readability but still be clearly. > > Also there are some other place in current kernel facing such performance/readability > race condition, but it seems that the developers prefer readability, why? While I > think performance is fatal to kernel.. > Memory hot-remove and page migration are not performance critical paths. For page migration, the cost will be likely dominated by the page copy but it's also possible the cost will be dominated by locking. The difference between a break and goto will not even be measurable. > > <SNIP> > > > > result. It's a little clumsy but the memory hot-remove failure message > > could list what applications have pinned the pages that cannot be removed > > so the administrator has the option of force-killing the application. It > > is possible to discover what application is pinning a page from userspace > > but it would involve an expensive search with /proc/kpagemap > > > >>> + if (migrate_pre_flag && !isolate_err) { > >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > > > The conversion of alloc_migrate_target is a bit problematic. It strips > > the __GFP_MOVABLE flag and the consequence of this is that it converts > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > > number of pages, particularly if the number of get_user_pages_non_movable() > > increases for short-lived pins like direct IO. > > Sorry, I don't quite understand here neither. If we use the following new > migration allocation function as you said, the increasing number of > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE > pages. What's the difference, do I miss something? > The replacement function preserves the __GFP_MOVABLE flag. It cannot use ZONE_MOVABLE but otherwise the newly allocated page will be grouped with other movable pages. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-18 15:17 ` Mel Gorman @ 2013-02-19 9:55 ` Lin Feng 2013-02-19 10:34 ` Mel Gorman 0 siblings, 1 reply; 34+ messages in thread From: Lin Feng @ 2013-02-19 9:55 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel Hi Mel, On 02/18/2013 11:17 PM, Mel Gorman wrote: >>> > > <SNIP> >>> > > >>> > > result. It's a little clumsy but the memory hot-remove failure message >>> > > could list what applications have pinned the pages that cannot be removed >>> > > so the administrator has the option of force-killing the application. It >>> > > is possible to discover what application is pinning a page from userspace >>> > > but it would involve an expensive search with /proc/kpagemap >>> > > >>>>> > >>> + if (migrate_pre_flag && !isolate_err) { >>>>> > >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >>>>> > >>> + false, MIGRATE_SYNC, MR_SYSCALL); >>> > > >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large >>> > > number of pages, particularly if the number of get_user_pages_non_movable() >>> > > increases for short-lived pins like direct IO. >> > >> > Sorry, I don't quite understand here neither. If we use the following new >> > migration allocation function as you said, the increasing number of >> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE >> > pages. What's the difference, do I miss something? >> > > The replacement function preserves the __GFP_MOVABLE flag. It cannot use > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with > other movable pages. Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.". Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little change? Does that approach work? Otherwise I have to follow your suggestion. thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() 2013-02-19 9:55 ` Lin Feng @ 2013-02-19 10:34 ` Mel Gorman 0 siblings, 0 replies; 34+ messages in thread From: Mel Gorman @ 2013-02-19 10:34 UTC (permalink / raw) To: Lin Feng Cc: Andrew Morton, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel On Tue, Feb 19, 2013 at 05:55:30PM +0800, Lin Feng wrote: > Hi Mel, > > On 02/18/2013 11:17 PM, Mel Gorman wrote: > >>> > > <SNIP> > >>> > > > >>> > > result. It's a little clumsy but the memory hot-remove failure message > >>> > > could list what applications have pinned the pages that cannot be removed > >>> > > so the administrator has the option of force-killing the application. It > >>> > > is possible to discover what application is pinning a page from userspace > >>> > > but it would involve an expensive search with /proc/kpagemap > >>> > > > >>>>> > >>> + if (migrate_pre_flag && !isolate_err) { > >>>>> > >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, > >>>>> > >>> + false, MIGRATE_SYNC, MR_SYSCALL); > >>> > > > >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips > >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts > >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > >>> > > number of pages, particularly if the number of get_user_pages_non_movable() > >>> > > increases for short-lived pins like direct IO. > >> > > >> > Sorry, I don't quite understand here neither. If we use the following new > >> > migration allocation function as you said, the increasing number of > >> > get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE > >> > pages. What's the difference, do I miss something? > >> > > > The replacement function preserves the __GFP_MOVABLE flag. It cannot use > > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with > > other movable pages. > > Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.". > Update the comment and describe the exception then. > Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so > that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little > change? > It should work (double check gfp_zone) but then the allocation cannot use the highmem zone. If you can be 100% certain that zone will not exist be populated then it will work as expected but it's a hack and should be commented clearly. You could do a BUILD_BUG_ON if CONFIG_HIGHMEM is set to enforce it. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove 2013-02-04 10:04 [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Lin Feng 2013-02-04 10:04 ` [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Lin Feng @ 2013-02-04 10:04 ` Lin Feng 2013-02-04 15:18 ` Jeff Moyer 2013-02-05 0:58 ` [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Minchan Kim 2 siblings, 1 reply; 34+ messages in thread From: Lin Feng @ 2013-02-04 10:04 UTC (permalink / raw) To: akpm, mgorman, bcrl, viro Cc: khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel, Lin Feng This patch gets around the aio ring pages can't be migrated bug caused by get_user_pages() via using the new function. It only works as configed with CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com> --- fs/aio.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/aio.c b/fs/aio.c index 71f613c..0e9b30a 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) } dprintk("mmap address: 0x%08lx\n", info->mmap_base); +#ifdef CONFIG_MEMORY_HOTREMOVE + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, + info->mmap_base, nr_pages, + 1, 0, info->ring_pages, NULL); +#else info->nr_pages = get_user_pages(current, ctx->mm, info->mmap_base, nr_pages, 1, 0, info->ring_pages, NULL); +#endif up_write(&ctx->mm->mmap_sem); if (unlikely(info->nr_pages != nr_pages)) { -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove 2013-02-04 10:04 ` [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Lin Feng @ 2013-02-04 15:18 ` Jeff Moyer 2013-02-04 23:02 ` Zach Brown 2013-02-05 5:06 ` Lin Feng 0 siblings, 2 replies; 34+ messages in thread From: Jeff Moyer @ 2013-02-04 15:18 UTC (permalink / raw) To: Lin Feng Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel Lin Feng <linfeng@cn.fujitsu.com> writes: > This patch gets around the aio ring pages can't be migrated bug caused by > get_user_pages() via using the new function. It only works as configed with > CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > > Cc: Benjamin LaHaise <bcrl@kvack.org> > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > Cc: Andrew Morton <akpm@linux-foundation.org> > Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com> > Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> > Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com> > --- > fs/aio.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/fs/aio.c b/fs/aio.c > index 71f613c..0e9b30a 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) > } > > dprintk("mmap address: 0x%08lx\n", info->mmap_base); > +#ifdef CONFIG_MEMORY_HOTREMOVE > + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, > + info->mmap_base, nr_pages, > + 1, 0, info->ring_pages, NULL); > +#else > info->nr_pages = get_user_pages(current, ctx->mm, > info->mmap_base, nr_pages, > 1, 0, info->ring_pages, NULL); > +#endif Can't you hide this in your 1/1 patch, by providing this function as just a static inline wrapper around get_user_pages when CONFIG_MEMORY_HOTREMOVE is not enabled? Cheers, Jeff -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove 2013-02-04 15:18 ` Jeff Moyer @ 2013-02-04 23:02 ` Zach Brown 2013-02-05 5:35 ` Lin Feng 2013-02-05 5:06 ` Lin Feng 1 sibling, 1 reply; 34+ messages in thread From: Zach Brown @ 2013-02-04 23:02 UTC (permalink / raw) To: Jeff Moyer Cc: Lin Feng, akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel > > index 71f613c..0e9b30a 100644 > > --- a/fs/aio.c > > +++ b/fs/aio.c > > @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) > > } > > > > dprintk("mmap address: 0x%08lx\n", info->mmap_base); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, > > + info->mmap_base, nr_pages, > > + 1, 0, info->ring_pages, NULL); > > +#else > > info->nr_pages = get_user_pages(current, ctx->mm, > > info->mmap_base, nr_pages, > > 1, 0, info->ring_pages, NULL); > > +#endif > > Can't you hide this in your 1/1 patch, by providing this function as > just a static inline wrapper around get_user_pages when > CONFIG_MEMORY_HOTREMOVE is not enabled? Yes, please. Having callers duplicate the call site for a single optional boolean input is unacceptable. But do we want another input argument as a name? Should aio have been using get_user_pages_fast()? (and so now _fast_non_movable?) I wonder if it's time to offer the booleans as a _flags() variant, much like the current internal flags for __get_user_pages(). The write and force arguments are already booleans, we have a different fast api, and now we're adding non-movable. The NON_MOVABLE flag would be 0 without MEMORY_HOTREMOVE, easy peasy. Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might also be nice :). No? - z -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove 2013-02-04 23:02 ` Zach Brown @ 2013-02-05 5:35 ` Lin Feng 0 siblings, 0 replies; 34+ messages in thread From: Lin Feng @ 2013-02-05 5:35 UTC (permalink / raw) To: Zach Brown Cc: Jeff Moyer, akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel, Tang chen, Gu Zheng Hi Zach, On 02/05/2013 07:02 AM, Zach Brown wrote: >>> index 71f613c..0e9b30a 100644 >>> --- a/fs/aio.c >>> +++ b/fs/aio.c >>> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) >>> } >>> >>> dprintk("mmap address: 0x%08lx\n", info->mmap_base); >>> +#ifdef CONFIG_MEMORY_HOTREMOVE >>> + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, >>> + info->mmap_base, nr_pages, >>> + 1, 0, info->ring_pages, NULL); >>> +#else >>> info->nr_pages = get_user_pages(current, ctx->mm, >>> info->mmap_base, nr_pages, >>> 1, 0, info->ring_pages, NULL); >>> +#endif >> >> Can't you hide this in your 1/1 patch, by providing this function as >> just a static inline wrapper around get_user_pages when >> CONFIG_MEMORY_HOTREMOVE is not enabled? > > Yes, please. Having callers duplicate the call site for a single > optional boolean input is unacceptable. I will deal with it in next version :) > > But do we want another input argument as a name? Should aio have been > using get_user_pages_fast()? (and so now _fast_non_movable?) > > I wonder if it's time to offer the booleans as a _flags() variant, much > like the current internal flags for __get_user_pages(). The write and > force arguments are already booleans, we have a different fast api, and > now we're adding non-movable. The NON_MOVABLE flag would be 0 without > MEMORY_HOTREMOVE, easy peasy. As my next reply-mail mentioned, IIUC in GUP case additional flags seems doesn't work, I abstract here: As I debuged the get_user_pages(), I found that some pages is already there and may be allocated before we call get_user_pages(). __get_user_pages() have following logic to handle such case. 1786 while (!(page = follow_page(vma, start, foll_flags))) { 1787 int ret; To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP as smart as we want , so I worked out the migration approach to get around and avoid messing up the current code. And even worse we have already got *8* arguments...Maybe we have to rework the boolean arguments into bit flags... It seems not a little work :( > > Turning current callers' mysterious '1, 1' in to 'WRITE|FORCE' might > also be nice :). Agree, maybe we could handle them later :) thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove 2013-02-04 15:18 ` Jeff Moyer 2013-02-04 23:02 ` Zach Brown @ 2013-02-05 5:06 ` Lin Feng 1 sibling, 0 replies; 34+ messages in thread From: Lin Feng @ 2013-02-05 5:06 UTC (permalink / raw) To: Jeff Moyer Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, minchan, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel, Tang chen, Gu Zheng Hi Jeff, On 02/04/2013 11:18 PM, Jeff Moyer wrote: >> --- >> fs/aio.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/fs/aio.c b/fs/aio.c >> index 71f613c..0e9b30a 100644 >> --- a/fs/aio.c >> +++ b/fs/aio.c >> @@ -138,9 +138,15 @@ static int aio_setup_ring(struct kioctx *ctx) >> } >> >> dprintk("mmap address: 0x%08lx\n", info->mmap_base); >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> + info->nr_pages = get_user_pages_non_movable(current, ctx->mm, >> + info->mmap_base, nr_pages, >> + 1, 0, info->ring_pages, NULL); >> +#else >> info->nr_pages = get_user_pages(current, ctx->mm, >> info->mmap_base, nr_pages, >> 1, 0, info->ring_pages, NULL); >> +#endif > > Can't you hide this in your 1/1 patch, by providing this function as > just a static inline wrapper around get_user_pages when > CONFIG_MEMORY_HOTREMOVE is not enabled? Good idea, it makes the callers more neatly :) thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages 2013-02-04 10:04 [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Lin Feng 2013-02-04 10:04 ` [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Lin Feng 2013-02-04 10:04 ` [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Lin Feng @ 2013-02-05 0:58 ` Minchan Kim 2013-02-05 4:42 ` Lin Feng 2 siblings, 1 reply; 34+ messages in thread From: Minchan Kim @ 2013-02-05 0:58 UTC (permalink / raw) To: Lin Feng Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel Hello, On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > Currently get_user_pages() always tries to allocate pages from movable zone, > as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > of get_user_pages() is easy to pin user pages for a long time(for now we found > that pages pinned as aio ring pages is such case), which is fatal for memory > hotplug/remove framework. > > So the 1st patch introduces a new library function called > get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > The 2nd patch gets around the aio ring pages can't be migrated bug caused by > get_user_pages() via using the new function. It only works when configed with > CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). CMA has same issue but the problem is the driver developers or any subsystem using GUP can't know their pages is in CMA area or not in advance. So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? Even some driver module in embedded side doesn't open their source code. I would like to make GUP smart so it allocates a page from non-movable/non-cma area when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :( > > Lin Feng (2): > mm: hotplug: implement non-movable version of get_user_pages() > fs/aio.c: use non-movable version of get_user_pages() to pin ring > pages when support memory hotremove > > fs/aio.c | 6 +++++ > include/linux/mm.h | 5 ++++ > include/linux/mmzone.h | 4 ++++ > mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/page_isolation.c | 5 ++++ > 5 files changed, 83 insertions(+) > > -- > 1.7.11.7 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages 2013-02-05 0:58 ` [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Minchan Kim @ 2013-02-05 4:42 ` Lin Feng 2013-02-05 5:25 ` Minchan Kim 0 siblings, 1 reply; 34+ messages in thread From: Lin Feng @ 2013-02-05 4:42 UTC (permalink / raw) To: Minchan Kim Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel Hi Minchan, On 02/05/2013 08:58 AM, Minchan Kim wrote: > Hello, > > On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: >> Currently get_user_pages() always tries to allocate pages from movable zone, >> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users >> of get_user_pages() is easy to pin user pages for a long time(for now we found >> that pages pinned as aio ring pages is such case), which is fatal for memory >> hotplug/remove framework. >> >> So the 1st patch introduces a new library function called >> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. >> It's a wrapper of get_user_pages() but it makes sure that all pages come from >> non-movable zone via additional page migration. >> >> The 2nd patch gets around the aio ring pages can't be migrated bug caused by >> get_user_pages() via using the new function. It only works when configed with >> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > > CMA has same issue but the problem is the driver developers or any subsystem > using GUP can't know their pages is in CMA area or not in advance. > So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? > Even some driver module in embedded side doesn't open their source code. Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users of GUP, they will release the pinned pages immediately and to such users they should get a good performance, using the old style interface is a smart way. And we had better just deal with the cases we have to by using the new interface. > > I would like to make GUP smart so it allocates a page from non-movable/non-cma area > when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP > performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :( As I debuged the get_user_pages(), I found that some pages is already there and may be allocated before we call get_user_pages(). __get_user_pages() have following logic to handle such case. 1786 while (!(page = follow_page(vma, start, foll_flags))) { 1787 int ret; To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP as smart as we want :( , so I worked out the migration approach to get around and avoid messing up the current code :) thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages 2013-02-05 4:42 ` Lin Feng @ 2013-02-05 5:25 ` Minchan Kim 2013-02-05 6:18 ` Lin Feng 0 siblings, 1 reply; 34+ messages in thread From: Minchan Kim @ 2013-02-05 5:25 UTC (permalink / raw) To: Lin Feng Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel Hi Lin, On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote: > Hi Minchan, > > On 02/05/2013 08:58 AM, Minchan Kim wrote: > > Hello, > > > > On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > >> Currently get_user_pages() always tries to allocate pages from movable zone, > >> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > >> of get_user_pages() is easy to pin user pages for a long time(for now we found > >> that pages pinned as aio ring pages is such case), which is fatal for memory > >> hotplug/remove framework. > >> > >> So the 1st patch introduces a new library function called > >> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > >> It's a wrapper of get_user_pages() but it makes sure that all pages come from > >> non-movable zone via additional page migration. > >> > >> The 2nd patch gets around the aio ring pages can't be migrated bug caused by > >> get_user_pages() via using the new function. It only works when configed with > >> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > > > > CMA has same issue but the problem is the driver developers or any subsystem > > using GUP can't know their pages is in CMA area or not in advance. > > So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? > > Even some driver module in embedded side doesn't open their source code. > Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users > of GUP, they will release the pinned pages immediately and to such users they should get > a good performance, using the old style interface is a smart way. And we had better just > deal with the cases we have to by using the new interface. Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know in advance they will use pinned pages long time or release in short time because it depends on some event like user's response which is very not predetermined. So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of failing migration by page count and then, investigate they are really using GUP and it's REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? Yes. it could be done but it would be rather trobulesome job. > > > > > I would like to make GUP smart so it allocates a page from non-movable/non-cma area > > when memory-hotplug/cma is enabled(CONFIG_MIGRATE_ISOLATE). Yeb. it might hurt GUP > > performance but it is just trade-off for using CMA/memory-hotplug, IMHO. :( > As I debuged the get_user_pages(), I found that some pages is already there and may be > allocated before we call get_user_pages(). __get_user_pages() have following logic to > handle such case. > 1786 while (!(page = follow_page(vma, start, foll_flags))) { > 1787 int ret; > To such case an additional alloc-flag or such doesn't work, it's difficult to keep GUP > as smart as we want :( , so I worked out the migration approach to get around and > avoid messing up the current code :) I didn't look at your code in detail yet but What's wrong following this? Change curret GUP with old_GUP. #ifdef CONFIG_MIGRATE_ISOLATE int get_user_pages_non_movable() { .. old_get_user_pages() .. } int get_user_pages() { return get_user_pages_non_movable(); } #else int get_user_pages() { return old_get_user_pages() } #endif > > thanks, > linfeng > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages 2013-02-05 5:25 ` Minchan Kim @ 2013-02-05 6:18 ` Lin Feng 2013-02-05 7:45 ` Minchan Kim 0 siblings, 1 reply; 34+ messages in thread From: Lin Feng @ 2013-02-05 6:18 UTC (permalink / raw) To: Minchan Kim Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel On 02/05/2013 01:25 PM, Minchan Kim wrote: > Hi Lin, > > On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote: >> Hi Minchan, >> >> On 02/05/2013 08:58 AM, Minchan Kim wrote: >>> Hello, >>> >>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: >>>> Currently get_user_pages() always tries to allocate pages from movable zone, >>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users >>>> of get_user_pages() is easy to pin user pages for a long time(for now we found >>>> that pages pinned as aio ring pages is such case), which is fatal for memory >>>> hotplug/remove framework. >>>> >>>> So the 1st patch introduces a new library function called >>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. >>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from >>>> non-movable zone via additional page migration. >>>> >>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by >>>> get_user_pages() via using the new function. It only works when configed with >>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). >>> >>> CMA has same issue but the problem is the driver developers or any subsystem >>> using GUP can't know their pages is in CMA area or not in advance. >>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? >>> Even some driver module in embedded side doesn't open their source code. >> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users >> of GUP, they will release the pinned pages immediately and to such users they should get >> a good performance, using the old style interface is a smart way. And we had better just >> deal with the cases we have to by using the new interface. > > Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages > immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power > by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know > in advance they will use pinned pages long time or release in short time because it depends > on some event like user's response which is very not predetermined. > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > failing migration by page count and then, investigate they are really using GUP and it's > REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? > > Yes. it could be done but it would be rather trobulesome job. Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is a big job. If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the *non_movable* version of GUP. But in some server environment we want to keep the performance but also want to use hotremove feature in case. Maybe patch the place as we need is a trade off for such support. Mel also said in the last discussion: On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote: >> Well, that's a fairly low-level implementation detail. A more typical >> approach would be to add a new get_user_pages_non_movable() or such. >> That would probably have the same signature as get_user_pages(), with >> one additional argument. Then get_user_pages() becomes a one-line >> wrapper which passes in a particular value of that argument. >> > > That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE > allocations. That will impact THP availability by increasing the number > of MIGRATE_UNMOVABLE blocks that exist and it would hit every user -- > not just those that care about ZONE_MOVABLE. > > I'm likely to NAK such a patch if it's only about node hot-remove because > it's much more of a corner case than wanting to use THP. > > I would prefer if get_user_pages() checked if the page it was about to > pin was in ZONE_MOVABLE and if so, migrate it at that point before it's > pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability > if that's what they want. The CMA people might also want to take > advantage of this if the page happened to be in the MIGRATE_CMA > pageblock. > So it may not a good idea that we all fall into calling the *non_movable* version of GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think? > #ifdef CONFIG_MIGRATE_ISOLATE > > int get_user_pages_non_movable() > { > .. > old_get_user_pages() > .. > } > > int get_user_pages() > { > return get_user_pages_non_movable(); > } > #else > int get_user_pages() > { > return old_get_user_pages() > } > #endif thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages 2013-02-05 6:18 ` Lin Feng @ 2013-02-05 7:45 ` Minchan Kim 2013-02-05 8:27 ` Lin Feng 0 siblings, 1 reply; 34+ messages in thread From: Minchan Kim @ 2013-02-05 7:45 UTC (permalink / raw) To: Lin Feng Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel On Tue, Feb 05, 2013 at 02:18:42PM +0800, Lin Feng wrote: > > > On 02/05/2013 01:25 PM, Minchan Kim wrote: > > Hi Lin, > > > > On Tue, Feb 05, 2013 at 12:42:48PM +0800, Lin Feng wrote: > >> Hi Minchan, > >> > >> On 02/05/2013 08:58 AM, Minchan Kim wrote: > >>> Hello, > >>> > >>> On Mon, Feb 04, 2013 at 06:04:06PM +0800, Lin Feng wrote: > >>>> Currently get_user_pages() always tries to allocate pages from movable zone, > >>>> as discussed in thread https://lkml.org/lkml/2012/11/29/69, in some case users > >>>> of get_user_pages() is easy to pin user pages for a long time(for now we found > >>>> that pages pinned as aio ring pages is such case), which is fatal for memory > >>>> hotplug/remove framework. > >>>> > >>>> So the 1st patch introduces a new library function called > >>>> get_user_pages_non_movable() to pin pages only from zone non-movable in memory. > >>>> It's a wrapper of get_user_pages() but it makes sure that all pages come from > >>>> non-movable zone via additional page migration. > >>>> > >>>> The 2nd patch gets around the aio ring pages can't be migrated bug caused by > >>>> get_user_pages() via using the new function. It only works when configed with > >>>> CONFIG_MEMORY_HOTREMOVE, otherwise it uses the old version of get_user_pages(). > >>> > >>> CMA has same issue but the problem is the driver developers or any subsystem > >>> using GUP can't know their pages is in CMA area or not in advance. > >>> So all of client of GUP should use GUP_NM to work them with CMA/MEMORY_HOTPLUG well? > >>> Even some driver module in embedded side doesn't open their source code. > >> Yes, it somehow depends on the users of GUP. In MEMORY_HOTPLUG case, as for most users > >> of GUP, they will release the pinned pages immediately and to such users they should get > >> a good performance, using the old style interface is a smart way. And we had better just > >> deal with the cases we have to by using the new interface. > > > > Hmm, I think you can't make sure most of user for MEMORY_HOTPLUG will release pinned pages > > immediately. Because MEMORY_HOTPLUG could be used for embedded system for reducing power > > by PASR and some drivers in embedded could use GUP anytime and anywhere. They can't know > > in advance they will use pinned pages long time or release in short time because it depends > > on some event like user's response which is very not predetermined. > > So for solving it, we can add some WARN_ON in CMA/MEMORY_HOTPLUG part just in case of > > failing migration by page count and then, investigate they are really using GUP and it's > > REALLY a culprit. If so, yell to them "Please use GUP_NM instead"? > > > > Yes. it could be done but it would be rather trobulesome job. > Yes WARN_ON may be easy while troubleshooting for finding the immigrate-able page is > a big job. > If we want to kill all the potential immigrate-able pages caused by GUP we'd better use the > *non_movable* version of GUP. > But in some server environment we want to keep the performance but also want to use hotremove > feature in case. Maybe patch the place as we need is a trade off for such support. > > Mel also said in the last discussion: > > On 11/30/2012 07:00 PM, Mel Gorman wrote:> On Thu, Nov 29, 2012 at 11:55:02PM -0800, Andrew Morton wrote: > >> Well, that's a fairly low-level implementation detail. A more typical > >> approach would be to add a new get_user_pages_non_movable() or such. > >> That would probably have the same signature as get_user_pages(), with > >> one additional argument. Then get_user_pages() becomes a one-line > >> wrapper which passes in a particular value of that argument. > >> > > > > That is going in the direction that all pinned pages become MIGRATE_UNMOVABLE > > allocations. That will impact THP availability by increasing the number > > of MIGRATE_UNMOVABLE blocks that exist and it would hit every user -- > > not just those that care about ZONE_MOVABLE. > > > > I'm likely to NAK such a patch if it's only about node hot-remove because > > it's much more of a corner case than wanting to use THP. > > > > I would prefer if get_user_pages() checked if the page it was about to > > pin was in ZONE_MOVABLE and if so, migrate it at that point before it's > > pinned. It'll be expensive but will guarantee ZONE_MOVABLE availability > > if that's what they want. The CMA people might also want to take > > advantage of this if the page happened to be in the MIGRATE_CMA > > pageblock. > > > > So it may not a good idea that we all fall into calling the *non_movable* version of > GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think? Frankly speaking, I can't understand Mel's comment. AFAIUC, he said GUP checks the page before get_page and if the page is movable zone, then migrate it out of movable zone and get_page again. That's exactly what I want. It doesn't introduce GUP_NM. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages 2013-02-05 7:45 ` Minchan Kim @ 2013-02-05 8:27 ` Lin Feng 0 siblings, 0 replies; 34+ messages in thread From: Lin Feng @ 2013-02-05 8:27 UTC (permalink / raw) To: Minchan Kim Cc: akpm, mgorman, bcrl, viro, khlebnikov, walken, kamezawa.hiroyu, riel, rientjes, isimatu.yasuaki, wency, laijs, jiang.liu, linux-mm, linux-aio, linux-fsdevel, linux-kernel Hi Minchan, On 02/05/2013 03:45 PM, Minchan Kim wrote: >> So it may not a good idea that we all fall into calling the *non_movable* version of >> > GUP when CONFIG_MIGRATE_ISOLATE is on. What do you think? > Frankly speaking, I can't understand Mel's comment. > AFAIUC, he said GUP checks the page before get_page and if the page is movable zone, > then migrate it out of movable zone and get_page again. > That's exactly what I want. It doesn't introduce GUP_NM. Since an long time pin or not is an unpredictable behave except you know what the caller wants to do. We have to check every time we call GUP, and GUP may need another parameter to teach itself to make the right decision? We have already got *8* parameters :( thanks, linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2013-02-20 11:54 UTC | newest] Thread overview: 34+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-02-04 10:04 [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Lin Feng 2013-02-04 10:04 ` [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Lin Feng 2013-02-05 0:06 ` Andrew Morton 2013-02-05 0:18 ` Andrew Morton 2013-02-05 3:09 ` Lin Feng 2013-02-05 21:13 ` Andrew Morton 2013-02-05 11:57 ` Mel Gorman 2013-02-05 13:32 ` Mel Gorman 2013-02-19 13:37 ` Lin Feng 2013-02-20 2:34 ` Lin Feng 2013-02-20 2:44 ` Wanpeng Li 2013-02-20 2:44 ` Wanpeng Li [not found] ` <20130220024435.GA30208@hacker.(null)> 2013-02-20 2:59 ` Lin Feng 2013-02-20 9:58 ` Simon Jeons 2013-02-20 10:23 ` Lin Feng 2013-02-20 11:31 ` Simon Jeons 2013-02-20 11:54 ` Lin Feng 2013-02-06 2:26 ` Michel Lespinasse 2013-02-06 10:41 ` Mel Gorman 2013-02-18 10:34 ` Lin Feng 2013-02-18 15:17 ` Mel Gorman 2013-02-19 9:55 ` Lin Feng 2013-02-19 10:34 ` Mel Gorman 2013-02-04 10:04 ` [PATCH 2/2] fs/aio.c: use get_user_pages_non_movable() to pin ring pages when support memory hotremove Lin Feng 2013-02-04 15:18 ` Jeff Moyer 2013-02-04 23:02 ` Zach Brown 2013-02-05 5:35 ` Lin Feng 2013-02-05 5:06 ` Lin Feng 2013-02-05 0:58 ` [PATCH 0/2] mm: hotplug: implement non-movable version of get_user_pages() to kill long-time pin pages Minchan Kim 2013-02-05 4:42 ` Lin Feng 2013-02-05 5:25 ` Minchan Kim 2013-02-05 6:18 ` Lin Feng 2013-02-05 7:45 ` Minchan Kim 2013-02-05 8:27 ` Lin Feng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).