linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm: fix up zone's present_pages
@ 2012-11-19  8:45 Bob Liu
  2012-11-19  9:07 ` Jiang Liu
  2012-11-19  9:12 ` Wen Congyang
  0 siblings, 2 replies; 8+ messages in thread
From: Bob Liu @ 2012-11-19  8:45 UTC (permalink / raw)
  To: akpm
  Cc: jiang.liu, maciej.rutecki, chris2553, rjw, mgorman, minchan,
	kamezawa.hiroyu, mhocko, wency, daniel.vetter, rientjes,
	wujianguo, ptesarik, riel, linux-mm, Bob Liu

zone->present_pages shoule be:
spanned pages - absent pages - bootmem pages(including memmap pages),
but now it's:
spanned pages - absent pages - memmap pages.
And it didn't consider whether the memmap pages is actully allocated from the
zone or not which may cause problem when memory hotplug is improved recently.

For example:
numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem
allocated from ZONE_MOVABLE.
So ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now
it also minus memmap pages, which are actually allocated from ZONE_MOVABLE.
This is wrong and when offlining all memory of this zone:
(zone->present_pages -= offline_pages) will less than 0.
Since present_pages is unsigned long type, that is actually a very large
integer which will cause zone->watermark[WMARK_MIN] becomes a large
integer too(see setup_per_zone_wmarks()).
As a result, totalreserve_pages become a large integer also and finally memory
allocating will fail in __vm_enough_memory().

Related discuss:
http://lkml.org/lkml/2012/11/5/866
https://patchwork.kernel.org/patch/1346751/

Related patches in mmotm:
mm: fix-up zone present pages(7f1290f2f2a4d2c) (sometimes cause egression)
mm: fix a regression with HIGHMEM(fe2cebd5a259eec) (Andrew have some feedback)

Jiang Liu have sent a series patches to fix this issue by adding a
managed_pages area to zone struct:
[RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

But i think it's too complicated.
Mine is based on the two related patches already in mmotm(need to revert them
first)
It fix the calculation of zone->present_pages by:
1. Reset the zone->present_pages to zero before
free_all_bootmem(),free_all_bootmem_node() and free_low_memory_core_early().
I think these should already included all path in all arch.

2. If there is a page freed to buddy system in __free_pages_bootmem(),
add zone->present_pages accrodingly.

Note this patch assumes that bootmem won't use memory above ZONE_HIGHMEM, so
only zones below ZONE_HIGHMEM are reset/fixed. If not, some update is needed.
For ZONE_HIGHMEM, only fix it's init value to:
panned_pages - absent_pages in free_area_init_core().

Only did some simple test currently.

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Bob Liu <lliubbo@gmail.com>
---
 include/linux/mm.h |    3 +++
 mm/bootmem.c       |    2 ++
 mm/nobootmem.c     |    1 +
 mm/page_alloc.c    |   49 +++++++++++++++++++++++++++++++++++++------------
 4 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b03cab..3b40eb6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1763,5 +1763,8 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
+extern void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat);
+extern void reset_lowmem_zone_present_pages(void);
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/bootmem.c b/mm/bootmem.c
index 26d057a..661775b 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -238,6 +238,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
 unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
 {
 	register_page_bootmem_info_node(pgdat);
+	reset_lowmem_zone_present_pages_pernode(pgdat);
 	return free_all_bootmem_core(pgdat->bdata);
 }
 
@@ -251,6 +252,7 @@ unsigned long __init free_all_bootmem(void)
 	unsigned long total_pages = 0;
 	bootmem_data_t *bdata;
 
+	reset_lowmem_zone_present_pages();
 	list_for_each_entry(bdata, &bdata_list, list)
 		total_pages += free_all_bootmem_core(bdata);
 
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bd82f6b..378d50a 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -126,6 +126,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
 	phys_addr_t start, end, size;
 	u64 i;
 
+	reset_lowmem_zone_present_pages();
 	for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
 		count += __free_memory_core(start, end);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07425a7..76d37f0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -735,6 +735,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
 {
 	unsigned int nr_pages = 1 << order;
 	unsigned int loop;
+	struct zone *zone;
 
 	prefetchw(page);
 	for (loop = 0; loop < nr_pages; loop++) {
@@ -748,6 +749,9 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
 
 	set_page_refcounted(page);
 	__free_pages(page, order);
+	zone = page_zone(page);
+	WARN_ON(!(is_normal(zone) || is_dma(zone) || is_dma32(zone)));
+	zone->present_pages += nr_pages;
 }
 
 #ifdef CONFIG_CMA
@@ -4547,18 +4551,20 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		 * is used by this zone for memmap. This affects the watermark
 		 * and per-cpu initialisations
 		 */
-		memmap_pages =
-			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
-		if (realsize >= memmap_pages) {
-			realsize -= memmap_pages;
-			if (memmap_pages)
-				printk(KERN_DEBUG
-				       "  %s zone: %lu pages used for memmap\n",
-				       zone_names[j], memmap_pages);
-		} else
-			printk(KERN_WARNING
-				"  %s zone: %lu pages exceeds realsize %lu\n",
-				zone_names[j], memmap_pages, realsize);
+		if (j < ZONE_HIGHMEM) {
+			memmap_pages =
+				PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
+			if (realsize >= memmap_pages) {
+				realsize -= memmap_pages;
+				if (memmap_pages)
+					printk(KERN_DEBUG
+							"  %s zone: %lu pages used for memmap\n",
+							zone_names[j], memmap_pages);
+			} else
+				printk(KERN_WARNING
+						"  %s zone: %lu pages exceeds realsize %lu\n",
+						zone_names[j], memmap_pages, realsize);
+		}
 
 		/* Account for reserved pages */
 		if (j == 0 && realsize > dma_reserve) {
@@ -6143,3 +6149,22 @@ void dump_page(struct page *page)
 	dump_page_flags(page->flags);
 	mem_cgroup_print_bad_page(page);
 }
+
+/* reset zone->present_pages to 0 for zones below ZONE_HIGHMEM */
+void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat)
+{
+	int i;
+	struct zone *z;
+	for (i = 0; i < ZONE_HIGHMEM; i++) {
+		z = pgdat->node_zones + i;
+		z->present_pages = 0;
+	}
+}
+
+void reset_lowmem_zone_present_pages(void)
+{
+	int nid;
+
+	for_each_node_state(nid, N_HIGH_MEMORY)
+		reset_lowmem_zone_present_pages_pernode(NODE_DATA(nid));
+}
-- 
1.7.9.5


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: fix up zone's present_pages
  2012-11-19  8:45 [RFC PATCH] mm: fix up zone's present_pages Bob Liu
@ 2012-11-19  9:07 ` Jiang Liu
  2012-11-20  7:44   ` Bob Liu
  2012-11-19  9:12 ` Wen Congyang
  1 sibling, 1 reply; 8+ messages in thread
From: Jiang Liu @ 2012-11-19  9:07 UTC (permalink / raw)
  To: Bob Liu
  Cc: akpm, maciej.rutecki, chris2553, rjw, mgorman, minchan,
	kamezawa.hiroyu, mhocko, wency, daniel.vetter, rientjes,
	wujianguo, ptesarik, riel, linux-mm

On 2012-11-19 16:45, Bob Liu wrote:
> zone->present_pages shoule be:
> spanned pages - absent pages - bootmem pages(including memmap pages),
> but now it's:
> spanned pages - absent pages - memmap pages.
> And it didn't consider whether the memmap pages is actully allocated from the
> zone or not which may cause problem when memory hotplug is improved recently.
> 
> For example:
> numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem
> allocated from ZONE_MOVABLE.
> So ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now
> it also minus memmap pages, which are actually allocated from ZONE_MOVABLE.
> This is wrong and when offlining all memory of this zone:
> (zone->present_pages -= offline_pages) will less than 0.
> Since present_pages is unsigned long type, that is actually a very large
> integer which will cause zone->watermark[WMARK_MIN] becomes a large
> integer too(see setup_per_zone_wmarks()).
> As a result, totalreserve_pages become a large integer also and finally memory
> allocating will fail in __vm_enough_memory().
> 
> Related discuss:
> http://lkml.org/lkml/2012/11/5/866
> https://patchwork.kernel.org/patch/1346751/
> 
> Related patches in mmotm:
> mm: fix-up zone present pages(7f1290f2f2a4d2c) (sometimes cause egression)
> mm: fix a regression with HIGHMEM(fe2cebd5a259eec) (Andrew have some feedback)
> 
> Jiang Liu have sent a series patches to fix this issue by adding a
> managed_pages area to zone struct:
> [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
> 
> But i think it's too complicated.
> Mine is based on the two related patches already in mmotm(need to revert them
> first)
> It fix the calculation of zone->present_pages by:
> 1. Reset the zone->present_pages to zero before
> free_all_bootmem(),free_all_bootmem_node() and free_low_memory_core_early().
> I think these should already included all path in all arch.
> 
> 2. If there is a page freed to buddy system in __free_pages_bootmem(),
> add zone->present_pages accrodingly.
> 
> Note this patch assumes that bootmem won't use memory above ZONE_HIGHMEM, so
> only zones below ZONE_HIGHMEM are reset/fixed. If not, some update is needed.
> For ZONE_HIGHMEM, only fix it's init value to:
> panned_pages - absent_pages in free_area_init_core().
> 
> Only did some simple test currently.
Hi Bobi 1/4 ?
	Great to know that you are working on this issue too.
Originally I have thought about reusing the zone->present_pages.
And later I propose to add the new field "managed_pages" because
we could know that there are some pages not managed by the buddy
system in the zone (present_pages - managed_pages). This may help
the ongoing memory power management work from Srivatsa S. Bhat
because we can't put memory ranges into low power state if there
are unmanaged pages.
And pgdat->node_present_pages = spanned_pages - absent_pages,
so it would be better to keep consistence with node_present_pages
by setting zone->present_pages = spanned_pages - absent_pages.
	Thanks!
	Gerry

> 
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
> Signed-off-by: Bob Liu <lliubbo@gmail.com>
> ---
>  include/linux/mm.h |    3 +++
>  mm/bootmem.c       |    2 ++
>  mm/nobootmem.c     |    1 +
>  mm/page_alloc.c    |   49 +++++++++++++++++++++++++++++++++++++------------
>  4 files changed, 43 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7b03cab..3b40eb6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1763,5 +1763,8 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
>  static inline bool page_is_guard(struct page *page) { return false; }
>  #endif /* CONFIG_DEBUG_PAGEALLOC */
>  
> +extern void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat);
> +extern void reset_lowmem_zone_present_pages(void);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/mm/bootmem.c b/mm/bootmem.c
> index 26d057a..661775b 100644
> --- a/mm/bootmem.c
> +++ b/mm/bootmem.c
> @@ -238,6 +238,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
>  unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
>  {
>  	register_page_bootmem_info_node(pgdat);
> +	reset_lowmem_zone_present_pages_pernode(pgdat);
>  	return free_all_bootmem_core(pgdat->bdata);
>  }
>  
> @@ -251,6 +252,7 @@ unsigned long __init free_all_bootmem(void)
>  	unsigned long total_pages = 0;
>  	bootmem_data_t *bdata;
>  
> +	reset_lowmem_zone_present_pages();
>  	list_for_each_entry(bdata, &bdata_list, list)
>  		total_pages += free_all_bootmem_core(bdata);
>  
> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> index bd82f6b..378d50a 100644
> --- a/mm/nobootmem.c
> +++ b/mm/nobootmem.c
> @@ -126,6 +126,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
>  	phys_addr_t start, end, size;
>  	u64 i;
>  
> +	reset_lowmem_zone_present_pages();
>  	for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
>  		count += __free_memory_core(start, end);
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 07425a7..76d37f0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,6 +735,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>  {
>  	unsigned int nr_pages = 1 << order;
>  	unsigned int loop;
> +	struct zone *zone;
>  
>  	prefetchw(page);
>  	for (loop = 0; loop < nr_pages; loop++) {
> @@ -748,6 +749,9 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>  
>  	set_page_refcounted(page);
>  	__free_pages(page, order);
> +	zone = page_zone(page);
> +	WARN_ON(!(is_normal(zone) || is_dma(zone) || is_dma32(zone)));
> +	zone->present_pages += nr_pages;
>  }
>  
>  #ifdef CONFIG_CMA
> @@ -4547,18 +4551,20 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  		 * is used by this zone for memmap. This affects the watermark
>  		 * and per-cpu initialisations
>  		 */
> -		memmap_pages =
> -			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
> -		if (realsize >= memmap_pages) {
> -			realsize -= memmap_pages;
> -			if (memmap_pages)
> -				printk(KERN_DEBUG
> -				       "  %s zone: %lu pages used for memmap\n",
> -				       zone_names[j], memmap_pages);
> -		} else
> -			printk(KERN_WARNING
> -				"  %s zone: %lu pages exceeds realsize %lu\n",
> -				zone_names[j], memmap_pages, realsize);
> +		if (j < ZONE_HIGHMEM) {
> +			memmap_pages =
> +				PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
> +			if (realsize >= memmap_pages) {
> +				realsize -= memmap_pages;
> +				if (memmap_pages)
> +					printk(KERN_DEBUG
> +							"  %s zone: %lu pages used for memmap\n",
> +							zone_names[j], memmap_pages);
> +			} else
> +				printk(KERN_WARNING
> +						"  %s zone: %lu pages exceeds realsize %lu\n",
> +						zone_names[j], memmap_pages, realsize);
> +		}
>  
>  		/* Account for reserved pages */
>  		if (j == 0 && realsize > dma_reserve) {
> @@ -6143,3 +6149,22 @@ void dump_page(struct page *page)
>  	dump_page_flags(page->flags);
>  	mem_cgroup_print_bad_page(page);
>  }
> +
> +/* reset zone->present_pages to 0 for zones below ZONE_HIGHMEM */
> +void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat)
> +{
> +	int i;
> +	struct zone *z;
> +	for (i = 0; i < ZONE_HIGHMEM; i++) {
> +		z = pgdat->node_zones + i;
> +		z->present_pages = 0;
> +	}
> +}
> +
> +void reset_lowmem_zone_present_pages(void)
> +{
> +	int nid;
> +
> +	for_each_node_state(nid, N_HIGH_MEMORY)
> +		reset_lowmem_zone_present_pages_pernode(NODE_DATA(nid));
> +}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: fix up zone's present_pages
  2012-11-19  9:12 ` Wen Congyang
@ 2012-11-19  9:11   ` Jiang Liu
  2012-11-20  3:47     ` Jaegeuk Hanse
  0 siblings, 1 reply; 8+ messages in thread
From: Jiang Liu @ 2012-11-19  9:11 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Bob Liu, akpm, maciej.rutecki, chris2553, rjw, mgorman, minchan,
	kamezawa.hiroyu, mhocko, daniel.vetter, rientjes, wujianguo,
	ptesarik, riel, linux-mm, lai jiangshan

On 2012-11-19 17:12, Wen Congyang wrote:
> At 11/19/2012 04:45 PM, Bob Liu Wrote:
>> zone->present_pages shoule be:
>> spanned pages - absent pages - bootmem pages(including memmap pages),
>> but now it's:
>> spanned pages - absent pages - memmap pages.
>> And it didn't consider whether the memmap pages is actully allocated from the
>> zone or not which may cause problem when memory hotplug is improved recently.
>>
>> For example:
>> numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem
>> allocated from ZONE_MOVABLE.
>> So ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now
>> it also minus memmap pages, which are actually allocated from ZONE_MOVABLE.
>> This is wrong and when offlining all memory of this zone:
>> (zone->present_pages -= offline_pages) will less than 0.
>> Since present_pages is unsigned long type, that is actually a very large
>> integer which will cause zone->watermark[WMARK_MIN] becomes a large
>> integer too(see setup_per_zone_wmarks()).
>> As a result, totalreserve_pages become a large integer also and finally memory
>> allocating will fail in __vm_enough_memory().
>>
>> Related discuss:
>> http://lkml.org/lkml/2012/11/5/866
>> https://patchwork.kernel.org/patch/1346751/
>>
>> Related patches in mmotm:
>> mm: fix-up zone present pages(7f1290f2f2a4d2c) (sometimes cause egression)
>> mm: fix a regression with HIGHMEM(fe2cebd5a259eec) (Andrew have some feedback)
>>
>> Jiang Liu have sent a series patches to fix this issue by adding a
>> managed_pages area to zone struct:
>> [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
>>
>> But i think it's too complicated.
>> Mine is based on the two related patches already in mmotm(need to revert them
>> first)
>> It fix the calculation of zone->present_pages by:
>> 1. Reset the zone->present_pages to zero before
>> free_all_bootmem(),free_all_bootmem_node() and free_low_memory_core_early().
>> I think these should already included all path in all arch.
>>
>> 2. If there is a page freed to buddy system in __free_pages_bootmem(),
>> add zone->present_pages accrodingly.
>>
>> Note this patch assumes that bootmem won't use memory above ZONE_HIGHMEM, so
> 
> Hmm, on x86_64 box, bootmem uses the memory in ZONE_MOVABLE and
> ZONE_MOVABLE > ZONE_HIGHMEM.
That's an issue, I'm trying to avoid allocating bootmem from ZONE_MOVABLE.
And is_highmem(z) or is_highmem_idx(zoneidx) could be safely used to distinguish
movable highmem zones.

>> only zones below ZONE_HIGHMEM are reset/fixed. If not, some update is needed.
>> For ZONE_HIGHMEM, only fix it's init value to:
>> panned_pages - absent_pages in free_area_init_core().
>>
>> Only did some simple test currently.
>>
>> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
>> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
>> Signed-off-by: Bob Liu <lliubbo@gmail.com>
>> ---
>>  include/linux/mm.h |    3 +++
>>  mm/bootmem.c       |    2 ++
>>  mm/nobootmem.c     |    1 +
>>  mm/page_alloc.c    |   49 +++++++++++++++++++++++++++++++++++++------------
>>  4 files changed, 43 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 7b03cab..3b40eb6 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1763,5 +1763,8 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
>>  static inline bool page_is_guard(struct page *page) { return false; }
>>  #endif /* CONFIG_DEBUG_PAGEALLOC */
>>  
>> +extern void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat);
>> +extern void reset_lowmem_zone_present_pages(void);
>> +
>>  #endif /* __KERNEL__ */
>>  #endif /* _LINUX_MM_H */
>> diff --git a/mm/bootmem.c b/mm/bootmem.c
>> index 26d057a..661775b 100644
>> --- a/mm/bootmem.c
>> +++ b/mm/bootmem.c
>> @@ -238,6 +238,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
>>  unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
>>  {
>>  	register_page_bootmem_info_node(pgdat);
>> +	reset_lowmem_zone_present_pages_pernode(pgdat);
>>  	return free_all_bootmem_core(pgdat->bdata);
>>  }
>>  
>> @@ -251,6 +252,7 @@ unsigned long __init free_all_bootmem(void)
>>  	unsigned long total_pages = 0;
>>  	bootmem_data_t *bdata;
>>  
>> +	reset_lowmem_zone_present_pages();
>>  	list_for_each_entry(bdata, &bdata_list, list)
>>  		total_pages += free_all_bootmem_core(bdata);
>>  
>> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
>> index bd82f6b..378d50a 100644
>> --- a/mm/nobootmem.c
>> +++ b/mm/nobootmem.c
>> @@ -126,6 +126,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
>>  	phys_addr_t start, end, size;
>>  	u64 i;
>>  
>> +	reset_lowmem_zone_present_pages();
>>  	for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
>>  		count += __free_memory_core(start, end);
>>  
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 07425a7..76d37f0 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -735,6 +735,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>>  {
>>  	unsigned int nr_pages = 1 << order;
>>  	unsigned int loop;
>> +	struct zone *zone;
>>  
>>  	prefetchw(page);
>>  	for (loop = 0; loop < nr_pages; loop++) {
>> @@ -748,6 +749,9 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>>  
>>  	set_page_refcounted(page);
>>  	__free_pages(page, order);
>> +	zone = page_zone(page);
>> +	WARN_ON(!(is_normal(zone) || is_dma(zone) || is_dma32(zone)));
>> +	zone->present_pages += nr_pages;
>>  }
>>  
>>  #ifdef CONFIG_CMA
>> @@ -4547,18 +4551,20 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>>  		 * is used by this zone for memmap. This affects the watermark
>>  		 * and per-cpu initialisations
>>  		 */
>> -		memmap_pages =
>> -			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>> -		if (realsize >= memmap_pages) {
>> -			realsize -= memmap_pages;
>> -			if (memmap_pages)
>> -				printk(KERN_DEBUG
>> -				       "  %s zone: %lu pages used for memmap\n",
>> -				       zone_names[j], memmap_pages);
>> -		} else
>> -			printk(KERN_WARNING
>> -				"  %s zone: %lu pages exceeds realsize %lu\n",
>> -				zone_names[j], memmap_pages, realsize);
>> +		if (j < ZONE_HIGHMEM) {
>> +			memmap_pages =
>> +				PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>> +			if (realsize >= memmap_pages) {
>> +				realsize -= memmap_pages;
>> +				if (memmap_pages)
>> +					printk(KERN_DEBUG
>> +							"  %s zone: %lu pages used for memmap\n",
>> +							zone_names[j], memmap_pages);
>> +			} else
>> +				printk(KERN_WARNING
>> +						"  %s zone: %lu pages exceeds realsize %lu\n",
>> +						zone_names[j], memmap_pages, realsize);
>> +		}
>>  
>>  		/* Account for reserved pages */
>>  		if (j == 0 && realsize > dma_reserve) {
>> @@ -6143,3 +6149,22 @@ void dump_page(struct page *page)
>>  	dump_page_flags(page->flags);
>>  	mem_cgroup_print_bad_page(page);
>>  }
>> +
>> +/* reset zone->present_pages to 0 for zones below ZONE_HIGHMEM */
>> +void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat)
>> +{
>> +	int i;
>> +	struct zone *z;
>> +	for (i = 0; i < ZONE_HIGHMEM; i++) {
> 
> And if CONFIG_HIGHMEM is no, ZONE_NORMAL == ZONE_HIGHMEM.
> 
> So, you don't reset ZONE_NORMAL here.
> 
>> +		z = pgdat->node_zones + i;
>> +		z->present_pages = 0;
>> +	}
>> +}
>> +
>> +void reset_lowmem_zone_present_pages(void)
>> +{
>> +	int nid;
>> +
>> +	for_each_node_state(nid, N_HIGH_MEMORY)
>> +		reset_lowmem_zone_present_pages_pernode(NODE_DATA(nid));
>> +}
> 
> 
> .
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: fix up zone's present_pages
  2012-11-19  8:45 [RFC PATCH] mm: fix up zone's present_pages Bob Liu
  2012-11-19  9:07 ` Jiang Liu
@ 2012-11-19  9:12 ` Wen Congyang
  2012-11-19  9:11   ` Jiang Liu
  1 sibling, 1 reply; 8+ messages in thread
From: Wen Congyang @ 2012-11-19  9:12 UTC (permalink / raw)
  To: Bob Liu
  Cc: akpm, jiang.liu, maciej.rutecki, chris2553, rjw, mgorman, minchan,
	kamezawa.hiroyu, mhocko, daniel.vetter, rientjes, wujianguo,
	ptesarik, riel, linux-mm, lai jiangshan

At 11/19/2012 04:45 PM, Bob Liu Wrote:
> zone->present_pages shoule be:
> spanned pages - absent pages - bootmem pages(including memmap pages),
> but now it's:
> spanned pages - absent pages - memmap pages.
> And it didn't consider whether the memmap pages is actully allocated from the
> zone or not which may cause problem when memory hotplug is improved recently.
> 
> For example:
> numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem
> allocated from ZONE_MOVABLE.
> So ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now
> it also minus memmap pages, which are actually allocated from ZONE_MOVABLE.
> This is wrong and when offlining all memory of this zone:
> (zone->present_pages -= offline_pages) will less than 0.
> Since present_pages is unsigned long type, that is actually a very large
> integer which will cause zone->watermark[WMARK_MIN] becomes a large
> integer too(see setup_per_zone_wmarks()).
> As a result, totalreserve_pages become a large integer also and finally memory
> allocating will fail in __vm_enough_memory().
> 
> Related discuss:
> http://lkml.org/lkml/2012/11/5/866
> https://patchwork.kernel.org/patch/1346751/
> 
> Related patches in mmotm:
> mm: fix-up zone present pages(7f1290f2f2a4d2c) (sometimes cause egression)
> mm: fix a regression with HIGHMEM(fe2cebd5a259eec) (Andrew have some feedback)
> 
> Jiang Liu have sent a series patches to fix this issue by adding a
> managed_pages area to zone struct:
> [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
> 
> But i think it's too complicated.
> Mine is based on the two related patches already in mmotm(need to revert them
> first)
> It fix the calculation of zone->present_pages by:
> 1. Reset the zone->present_pages to zero before
> free_all_bootmem(),free_all_bootmem_node() and free_low_memory_core_early().
> I think these should already included all path in all arch.
> 
> 2. If there is a page freed to buddy system in __free_pages_bootmem(),
> add zone->present_pages accrodingly.
> 
> Note this patch assumes that bootmem won't use memory above ZONE_HIGHMEM, so

Hmm, on x86_64 box, bootmem uses the memory in ZONE_MOVABLE and
ZONE_MOVABLE > ZONE_HIGHMEM.

> only zones below ZONE_HIGHMEM are reset/fixed. If not, some update is needed.
> For ZONE_HIGHMEM, only fix it's init value to:
> panned_pages - absent_pages in free_area_init_core().
> 
> Only did some simple test currently.
> 
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
> Signed-off-by: Bob Liu <lliubbo@gmail.com>
> ---
>  include/linux/mm.h |    3 +++
>  mm/bootmem.c       |    2 ++
>  mm/nobootmem.c     |    1 +
>  mm/page_alloc.c    |   49 +++++++++++++++++++++++++++++++++++++------------
>  4 files changed, 43 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7b03cab..3b40eb6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1763,5 +1763,8 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
>  static inline bool page_is_guard(struct page *page) { return false; }
>  #endif /* CONFIG_DEBUG_PAGEALLOC */
>  
> +extern void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat);
> +extern void reset_lowmem_zone_present_pages(void);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/mm/bootmem.c b/mm/bootmem.c
> index 26d057a..661775b 100644
> --- a/mm/bootmem.c
> +++ b/mm/bootmem.c
> @@ -238,6 +238,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
>  unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
>  {
>  	register_page_bootmem_info_node(pgdat);
> +	reset_lowmem_zone_present_pages_pernode(pgdat);
>  	return free_all_bootmem_core(pgdat->bdata);
>  }
>  
> @@ -251,6 +252,7 @@ unsigned long __init free_all_bootmem(void)
>  	unsigned long total_pages = 0;
>  	bootmem_data_t *bdata;
>  
> +	reset_lowmem_zone_present_pages();
>  	list_for_each_entry(bdata, &bdata_list, list)
>  		total_pages += free_all_bootmem_core(bdata);
>  
> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> index bd82f6b..378d50a 100644
> --- a/mm/nobootmem.c
> +++ b/mm/nobootmem.c
> @@ -126,6 +126,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
>  	phys_addr_t start, end, size;
>  	u64 i;
>  
> +	reset_lowmem_zone_present_pages();
>  	for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
>  		count += __free_memory_core(start, end);
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 07425a7..76d37f0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,6 +735,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>  {
>  	unsigned int nr_pages = 1 << order;
>  	unsigned int loop;
> +	struct zone *zone;
>  
>  	prefetchw(page);
>  	for (loop = 0; loop < nr_pages; loop++) {
> @@ -748,6 +749,9 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>  
>  	set_page_refcounted(page);
>  	__free_pages(page, order);
> +	zone = page_zone(page);
> +	WARN_ON(!(is_normal(zone) || is_dma(zone) || is_dma32(zone)));
> +	zone->present_pages += nr_pages;
>  }
>  
>  #ifdef CONFIG_CMA
> @@ -4547,18 +4551,20 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  		 * is used by this zone for memmap. This affects the watermark
>  		 * and per-cpu initialisations
>  		 */
> -		memmap_pages =
> -			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
> -		if (realsize >= memmap_pages) {
> -			realsize -= memmap_pages;
> -			if (memmap_pages)
> -				printk(KERN_DEBUG
> -				       "  %s zone: %lu pages used for memmap\n",
> -				       zone_names[j], memmap_pages);
> -		} else
> -			printk(KERN_WARNING
> -				"  %s zone: %lu pages exceeds realsize %lu\n",
> -				zone_names[j], memmap_pages, realsize);
> +		if (j < ZONE_HIGHMEM) {
> +			memmap_pages =
> +				PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
> +			if (realsize >= memmap_pages) {
> +				realsize -= memmap_pages;
> +				if (memmap_pages)
> +					printk(KERN_DEBUG
> +							"  %s zone: %lu pages used for memmap\n",
> +							zone_names[j], memmap_pages);
> +			} else
> +				printk(KERN_WARNING
> +						"  %s zone: %lu pages exceeds realsize %lu\n",
> +						zone_names[j], memmap_pages, realsize);
> +		}
>  
>  		/* Account for reserved pages */
>  		if (j == 0 && realsize > dma_reserve) {
> @@ -6143,3 +6149,22 @@ void dump_page(struct page *page)
>  	dump_page_flags(page->flags);
>  	mem_cgroup_print_bad_page(page);
>  }
> +
> +/* reset zone->present_pages to 0 for zones below ZONE_HIGHMEM */
> +void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat)
> +{
> +	int i;
> +	struct zone *z;
> +	for (i = 0; i < ZONE_HIGHMEM; i++) {

And if CONFIG_HIGHMEM is no, ZONE_NORMAL == ZONE_HIGHMEM.

So, you don't reset ZONE_NORMAL here.

> +		z = pgdat->node_zones + i;
> +		z->present_pages = 0;
> +	}
> +}
> +
> +void reset_lowmem_zone_present_pages(void)
> +{
> +	int nid;
> +
> +	for_each_node_state(nid, N_HIGH_MEMORY)
> +		reset_lowmem_zone_present_pages_pernode(NODE_DATA(nid));
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: fix up zone's present_pages
  2012-11-19  9:11   ` Jiang Liu
@ 2012-11-20  3:47     ` Jaegeuk Hanse
  2012-11-20  3:59       ` Wen Congyang
  0 siblings, 1 reply; 8+ messages in thread
From: Jaegeuk Hanse @ 2012-11-20  3:47 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Wen Congyang, Bob Liu, akpm, maciej.rutecki, chris2553, rjw,
	mgorman, minchan, kamezawa.hiroyu, mhocko, daniel.vetter,
	rientjes, wujianguo, ptesarik, riel, linux-mm, lai jiangshan

On 11/19/2012 05:11 PM, Jiang Liu wrote:
> On 2012-11-19 17:12, Wen Congyang wrote:
>> At 11/19/2012 04:45 PM, Bob Liu Wrote:
>>> zone->present_pages shoule be:
>>> spanned pages - absent pages - bootmem pages(including memmap pages),
>>> but now it's:
>>> spanned pages - absent pages - memmap pages.
>>> And it didn't consider whether the memmap pages is actully allocated from the
>>> zone or not which may cause problem when memory hotplug is improved recently.
>>>
>>> For example:
>>> numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem
>>> allocated from ZONE_MOVABLE.
>>> So ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now
>>> it also minus memmap pages, which are actually allocated from ZONE_MOVABLE.
>>> This is wrong and when offlining all memory of this zone:
>>> (zone->present_pages -= offline_pages) will less than 0.
>>> Since present_pages is unsigned long type, that is actually a very large
>>> integer which will cause zone->watermark[WMARK_MIN] becomes a large
>>> integer too(see setup_per_zone_wmarks()).
>>> As a result, totalreserve_pages become a large integer also and finally memory
>>> allocating will fail in __vm_enough_memory().
>>>
>>> Related discuss:
>>> http://lkml.org/lkml/2012/11/5/866
>>> https://patchwork.kernel.org/patch/1346751/
>>>
>>> Related patches in mmotm:
>>> mm: fix-up zone present pages(7f1290f2f2a4d2c) (sometimes cause egression)
>>> mm: fix a regression with HIGHMEM(fe2cebd5a259eec) (Andrew have some feedback)
>>>
>>> Jiang Liu have sent a series patches to fix this issue by adding a
>>> managed_pages area to zone struct:
>>> [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
>>>
>>> But i think it's too complicated.
>>> Mine is based on the two related patches already in mmotm(need to revert them
>>> first)
>>> It fix the calculation of zone->present_pages by:
>>> 1. Reset the zone->present_pages to zero before
>>> free_all_bootmem(),free_all_bootmem_node() and free_low_memory_core_early().
>>> I think these should already included all path in all arch.
>>>
>>> 2. If there is a page freed to buddy system in __free_pages_bootmem(),
>>> add zone->present_pages accrodingly.
>>>
>>> Note this patch assumes that bootmem won't use memory above ZONE_HIGHMEM, so
>> Hmm, on x86_64 box, bootmem uses the memory in ZONE_MOVABLE and
>> ZONE_MOVABLE > ZONE_HIGHMEM.
> That's an issue, I'm trying to avoid allocating bootmem from ZONE_MOVABLE.
> And is_highmem(z) or is_highmem_idx(zoneidx) could be safely used to distinguish
> movable highmem zones.
Hi Jiang,

- I'm not sure why the above mentioned bootmem use the memory in 
ZONE_MOVABLE, IIUR, because current nobootmem/memblock logic will alloc 
pages from highest available memory to lowest available memory to avoid 
fragmentation. Is it correct?
- why need avoid allocating bootmem from ZONE_MOVABLE?

Regards,
Jaejeuk
>
>>> only zones below ZONE_HIGHMEM are reset/fixed. If not, some update is needed.
>>> For ZONE_HIGHMEM, only fix it's init value to:
>>> panned_pages - absent_pages in free_area_init_core().
>>>
>>> Only did some simple test currently.
>>>
>>> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
>>> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
>>> Signed-off-by: Bob Liu <lliubbo@gmail.com>
>>> ---
>>>   include/linux/mm.h |    3 +++
>>>   mm/bootmem.c       |    2 ++
>>>   mm/nobootmem.c     |    1 +
>>>   mm/page_alloc.c    |   49 +++++++++++++++++++++++++++++++++++++------------
>>>   4 files changed, 43 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 7b03cab..3b40eb6 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1763,5 +1763,8 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
>>>   static inline bool page_is_guard(struct page *page) { return false; }
>>>   #endif /* CONFIG_DEBUG_PAGEALLOC */
>>>   
>>> +extern void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat);
>>> +extern void reset_lowmem_zone_present_pages(void);
>>> +
>>>   #endif /* __KERNEL__ */
>>>   #endif /* _LINUX_MM_H */
>>> diff --git a/mm/bootmem.c b/mm/bootmem.c
>>> index 26d057a..661775b 100644
>>> --- a/mm/bootmem.c
>>> +++ b/mm/bootmem.c
>>> @@ -238,6 +238,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
>>>   unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
>>>   {
>>>   	register_page_bootmem_info_node(pgdat);
>>> +	reset_lowmem_zone_present_pages_pernode(pgdat);
>>>   	return free_all_bootmem_core(pgdat->bdata);
>>>   }
>>>   
>>> @@ -251,6 +252,7 @@ unsigned long __init free_all_bootmem(void)
>>>   	unsigned long total_pages = 0;
>>>   	bootmem_data_t *bdata;
>>>   
>>> +	reset_lowmem_zone_present_pages();
>>>   	list_for_each_entry(bdata, &bdata_list, list)
>>>   		total_pages += free_all_bootmem_core(bdata);
>>>   
>>> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
>>> index bd82f6b..378d50a 100644
>>> --- a/mm/nobootmem.c
>>> +++ b/mm/nobootmem.c
>>> @@ -126,6 +126,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
>>>   	phys_addr_t start, end, size;
>>>   	u64 i;
>>>   
>>> +	reset_lowmem_zone_present_pages();
>>>   	for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
>>>   		count += __free_memory_core(start, end);
>>>   
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 07425a7..76d37f0 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -735,6 +735,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>>>   {
>>>   	unsigned int nr_pages = 1 << order;
>>>   	unsigned int loop;
>>> +	struct zone *zone;
>>>   
>>>   	prefetchw(page);
>>>   	for (loop = 0; loop < nr_pages; loop++) {
>>> @@ -748,6 +749,9 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>>>   
>>>   	set_page_refcounted(page);
>>>   	__free_pages(page, order);
>>> +	zone = page_zone(page);
>>> +	WARN_ON(!(is_normal(zone) || is_dma(zone) || is_dma32(zone)));
>>> +	zone->present_pages += nr_pages;
>>>   }
>>>   
>>>   #ifdef CONFIG_CMA
>>> @@ -4547,18 +4551,20 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>>>   		 * is used by this zone for memmap. This affects the watermark
>>>   		 * and per-cpu initialisations
>>>   		 */
>>> -		memmap_pages =
>>> -			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>>> -		if (realsize >= memmap_pages) {
>>> -			realsize -= memmap_pages;
>>> -			if (memmap_pages)
>>> -				printk(KERN_DEBUG
>>> -				       "  %s zone: %lu pages used for memmap\n",
>>> -				       zone_names[j], memmap_pages);
>>> -		} else
>>> -			printk(KERN_WARNING
>>> -				"  %s zone: %lu pages exceeds realsize %lu\n",
>>> -				zone_names[j], memmap_pages, realsize);
>>> +		if (j < ZONE_HIGHMEM) {
>>> +			memmap_pages =
>>> +				PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>>> +			if (realsize >= memmap_pages) {
>>> +				realsize -= memmap_pages;
>>> +				if (memmap_pages)
>>> +					printk(KERN_DEBUG
>>> +							"  %s zone: %lu pages used for memmap\n",
>>> +							zone_names[j], memmap_pages);
>>> +			} else
>>> +				printk(KERN_WARNING
>>> +						"  %s zone: %lu pages exceeds realsize %lu\n",
>>> +						zone_names[j], memmap_pages, realsize);
>>> +		}
>>>   
>>>   		/* Account for reserved pages */
>>>   		if (j == 0 && realsize > dma_reserve) {
>>> @@ -6143,3 +6149,22 @@ void dump_page(struct page *page)
>>>   	dump_page_flags(page->flags);
>>>   	mem_cgroup_print_bad_page(page);
>>>   }
>>> +
>>> +/* reset zone->present_pages to 0 for zones below ZONE_HIGHMEM */
>>> +void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat)
>>> +{
>>> +	int i;
>>> +	struct zone *z;
>>> +	for (i = 0; i < ZONE_HIGHMEM; i++) {
>> And if CONFIG_HIGHMEM is no, ZONE_NORMAL == ZONE_HIGHMEM.
>>
>> So, you don't reset ZONE_NORMAL here.
>>
>>> +		z = pgdat->node_zones + i;
>>> +		z->present_pages = 0;
>>> +	}
>>> +}
>>> +
>>> +void reset_lowmem_zone_present_pages(void)
>>> +{
>>> +	int nid;
>>> +
>>> +	for_each_node_state(nid, N_HIGH_MEMORY)
>>> +		reset_lowmem_zone_present_pages_pernode(NODE_DATA(nid));
>>> +}
>>
>> .
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: fix up zone's present_pages
  2012-11-20  3:47     ` Jaegeuk Hanse
@ 2012-11-20  3:59       ` Wen Congyang
  2012-11-20  4:11         ` Jaegeuk Hanse
  0 siblings, 1 reply; 8+ messages in thread
From: Wen Congyang @ 2012-11-20  3:59 UTC (permalink / raw)
  To: Jaegeuk Hanse
  Cc: Jiang Liu, Bob Liu, akpm, maciej.rutecki, chris2553, rjw, mgorman,
	minchan, kamezawa.hiroyu, mhocko, daniel.vetter, rientjes,
	wujianguo, ptesarik, riel, linux-mm, lai jiangshan

At 11/20/2012 11:47 AM, Jaegeuk Hanse Wrote:
> On 11/19/2012 05:11 PM, Jiang Liu wrote:
>> On 2012-11-19 17:12, Wen Congyang wrote:
>>> At 11/19/2012 04:45 PM, Bob Liu Wrote:
>>>> zone->present_pages shoule be:
>>>> spanned pages - absent pages - bootmem pages(including memmap pages),
>>>> but now it's:
>>>> spanned pages - absent pages - memmap pages.
>>>> And it didn't consider whether the memmap pages is actully allocated
>>>> from the
>>>> zone or not which may cause problem when memory hotplug is improved
>>>> recently.
>>>>
>>>> For example:
>>>> numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
>>>> bootmem
>>>> allocated from ZONE_MOVABLE.
>>>> So ZONE_NORMAL's present_pages should be spanned pages - absent
>>>> pages, but now
>>>> it also minus memmap pages, which are actually allocated from
>>>> ZONE_MOVABLE.
>>>> This is wrong and when offlining all memory of this zone:
>>>> (zone->present_pages -= offline_pages) will less than 0.
>>>> Since present_pages is unsigned long type, that is actually a very
>>>> large
>>>> integer which will cause zone->watermark[WMARK_MIN] becomes a large
>>>> integer too(see setup_per_zone_wmarks()).
>>>> As a result, totalreserve_pages become a large integer also and
>>>> finally memory
>>>> allocating will fail in __vm_enough_memory().
>>>>
>>>> Related discuss:
>>>> http://lkml.org/lkml/2012/11/5/866
>>>> https://patchwork.kernel.org/patch/1346751/
>>>>
>>>> Related patches in mmotm:
>>>> mm: fix-up zone present pages(7f1290f2f2a4d2c) (sometimes cause
>>>> egression)
>>>> mm: fix a regression with HIGHMEM(fe2cebd5a259eec) (Andrew have some
>>>> feedback)
>>>>
>>>> Jiang Liu have sent a series patches to fix this issue by adding a
>>>> managed_pages area to zone struct:
>>>> [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
>>>>
>>>> But i think it's too complicated.
>>>> Mine is based on the two related patches already in mmotm(need to
>>>> revert them
>>>> first)
>>>> It fix the calculation of zone->present_pages by:
>>>> 1. Reset the zone->present_pages to zero before
>>>> free_all_bootmem(),free_all_bootmem_node() and
>>>> free_low_memory_core_early().
>>>> I think these should already included all path in all arch.
>>>>
>>>> 2. If there is a page freed to buddy system in __free_pages_bootmem(),
>>>> add zone->present_pages accrodingly.
>>>>
>>>> Note this patch assumes that bootmem won't use memory above
>>>> ZONE_HIGHMEM, so
>>> Hmm, on x86_64 box, bootmem uses the memory in ZONE_MOVABLE and
>>> ZONE_MOVABLE > ZONE_HIGHMEM.
>> That's an issue, I'm trying to avoid allocating bootmem from
>> ZONE_MOVABLE.
>> And is_highmem(z) or is_highmem_idx(zoneidx) could be safely used to
>> distinguish
>> movable highmem zones.
> Hi Jiang,
> 
> - I'm not sure why the above mentioned bootmem use the memory in
> ZONE_MOVABLE, IIUR, because current nobootmem/memblock logic will alloc
> pages from highest available memory to lowest available memory to avoid
> fragmentation. Is it correct?
> - why need avoid allocating bootmem from ZONE_MOVABLE?

ZONE_MOVABLE means that, the memory can be moved to the other place. So
we can offline it. It is very useful for memory hotplug.

Thanks
Wen Congyang

> 
> Regards,
> Jaejeuk
>>
>>>> only zones below ZONE_HIGHMEM are reset/fixed. If not, some update
>>>> is needed.
>>>> For ZONE_HIGHMEM, only fix it's init value to:
>>>> panned_pages - absent_pages in free_area_init_core().
>>>>
>>>> Only did some simple test currently.
>>>>
>>>> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
>>>> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
>>>> Signed-off-by: Bob Liu <lliubbo@gmail.com>
>>>> ---
>>>>   include/linux/mm.h |    3 +++
>>>>   mm/bootmem.c       |    2 ++
>>>>   mm/nobootmem.c     |    1 +
>>>>   mm/page_alloc.c    |   49
>>>> +++++++++++++++++++++++++++++++++++++------------
>>>>   4 files changed, 43 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 7b03cab..3b40eb6 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -1763,5 +1763,8 @@ static inline unsigned int
>>>> debug_guardpage_minorder(void) { return 0; }
>>>>   static inline bool page_is_guard(struct page *page) { return false; }
>>>>   #endif /* CONFIG_DEBUG_PAGEALLOC */
>>>>   +extern void reset_lowmem_zone_present_pages_pernode(pg_data_t
>>>> *pgdat);
>>>> +extern void reset_lowmem_zone_present_pages(void);
>>>> +
>>>>   #endif /* __KERNEL__ */
>>>>   #endif /* _LINUX_MM_H */
>>>> diff --git a/mm/bootmem.c b/mm/bootmem.c
>>>> index 26d057a..661775b 100644
>>>> --- a/mm/bootmem.c
>>>> +++ b/mm/bootmem.c
>>>> @@ -238,6 +238,7 @@ static unsigned long __init
>>>> free_all_bootmem_core(bootmem_data_t *bdata)
>>>>   unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
>>>>   {
>>>>       register_page_bootmem_info_node(pgdat);
>>>> +    reset_lowmem_zone_present_pages_pernode(pgdat);
>>>>       return free_all_bootmem_core(pgdat->bdata);
>>>>   }
>>>>   @@ -251,6 +252,7 @@ unsigned long __init free_all_bootmem(void)
>>>>       unsigned long total_pages = 0;
>>>>       bootmem_data_t *bdata;
>>>>   +    reset_lowmem_zone_present_pages();
>>>>       list_for_each_entry(bdata, &bdata_list, list)
>>>>           total_pages += free_all_bootmem_core(bdata);
>>>>   diff --git a/mm/nobootmem.c b/mm/nobootmem.c
>>>> index bd82f6b..378d50a 100644
>>>> --- a/mm/nobootmem.c
>>>> +++ b/mm/nobootmem.c
>>>> @@ -126,6 +126,7 @@ unsigned long __init
>>>> free_low_memory_core_early(int nodeid)
>>>>       phys_addr_t start, end, size;
>>>>       u64 i;
>>>>   +    reset_lowmem_zone_present_pages();
>>>>       for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
>>>>           count += __free_memory_core(start, end);
>>>>   diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 07425a7..76d37f0 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -735,6 +735,7 @@ void __meminit __free_pages_bootmem(struct page
>>>> *page, unsigned int order)
>>>>   {
>>>>       unsigned int nr_pages = 1 << order;
>>>>       unsigned int loop;
>>>> +    struct zone *zone;
>>>>         prefetchw(page);
>>>>       for (loop = 0; loop < nr_pages; loop++) {
>>>> @@ -748,6 +749,9 @@ void __meminit __free_pages_bootmem(struct page
>>>> *page, unsigned int order)
>>>>         set_page_refcounted(page);
>>>>       __free_pages(page, order);
>>>> +    zone = page_zone(page);
>>>> +    WARN_ON(!(is_normal(zone) || is_dma(zone) || is_dma32(zone)));
>>>> +    zone->present_pages += nr_pages;
>>>>   }
>>>>     #ifdef CONFIG_CMA
>>>> @@ -4547,18 +4551,20 @@ static void __paginginit
>>>> free_area_init_core(struct pglist_data *pgdat,
>>>>            * is used by this zone for memmap. This affects the
>>>> watermark
>>>>            * and per-cpu initialisations
>>>>            */
>>>> -        memmap_pages =
>>>> -            PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>>>> -        if (realsize >= memmap_pages) {
>>>> -            realsize -= memmap_pages;
>>>> -            if (memmap_pages)
>>>> -                printk(KERN_DEBUG
>>>> -                       "  %s zone: %lu pages used for memmap\n",
>>>> -                       zone_names[j], memmap_pages);
>>>> -        } else
>>>> -            printk(KERN_WARNING
>>>> -                "  %s zone: %lu pages exceeds realsize %lu\n",
>>>> -                zone_names[j], memmap_pages, realsize);
>>>> +        if (j < ZONE_HIGHMEM) {
>>>> +            memmap_pages =
>>>> +                PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>>>> +            if (realsize >= memmap_pages) {
>>>> +                realsize -= memmap_pages;
>>>> +                if (memmap_pages)
>>>> +                    printk(KERN_DEBUG
>>>> +                            "  %s zone: %lu pages used for memmap\n",
>>>> +                            zone_names[j], memmap_pages);
>>>> +            } else
>>>> +                printk(KERN_WARNING
>>>> +                        "  %s zone: %lu pages exceeds realsize %lu\n",
>>>> +                        zone_names[j], memmap_pages, realsize);
>>>> +        }
>>>>             /* Account for reserved pages */
>>>>           if (j == 0 && realsize > dma_reserve) {
>>>> @@ -6143,3 +6149,22 @@ void dump_page(struct page *page)
>>>>       dump_page_flags(page->flags);
>>>>       mem_cgroup_print_bad_page(page);
>>>>   }
>>>> +
>>>> +/* reset zone->present_pages to 0 for zones below ZONE_HIGHMEM */
>>>> +void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat)
>>>> +{
>>>> +    int i;
>>>> +    struct zone *z;
>>>> +    for (i = 0; i < ZONE_HIGHMEM; i++) {
>>> And if CONFIG_HIGHMEM is no, ZONE_NORMAL == ZONE_HIGHMEM.
>>>
>>> So, you don't reset ZONE_NORMAL here.
>>>
>>>> +        z = pgdat->node_zones + i;
>>>> +        z->present_pages = 0;
>>>> +    }
>>>> +}
>>>> +
>>>> +void reset_lowmem_zone_present_pages(void)
>>>> +{
>>>> +    int nid;
>>>> +
>>>> +    for_each_node_state(nid, N_HIGH_MEMORY)
>>>> +        reset_lowmem_zone_present_pages_pernode(NODE_DATA(nid));
>>>> +}
>>>
>>> .
>>>
>>
>> -- 
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: fix up zone's present_pages
  2012-11-20  3:59       ` Wen Congyang
@ 2012-11-20  4:11         ` Jaegeuk Hanse
  0 siblings, 0 replies; 8+ messages in thread
From: Jaegeuk Hanse @ 2012-11-20  4:11 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Jiang Liu, Bob Liu, akpm, maciej.rutecki, chris2553, rjw, mgorman,
	minchan, kamezawa.hiroyu, mhocko, daniel.vetter, rientjes,
	wujianguo, ptesarik, riel, linux-mm, lai jiangshan

On 11/20/2012 11:59 AM, Wen Congyang wrote:
> At 11/20/2012 11:47 AM, Jaegeuk Hanse Wrote:
>> On 11/19/2012 05:11 PM, Jiang Liu wrote:
>>> On 2012-11-19 17:12, Wen Congyang wrote:
>>>> At 11/19/2012 04:45 PM, Bob Liu Wrote:
>>>>> zone->present_pages shoule be:
>>>>> spanned pages - absent pages - bootmem pages(including memmap pages),
>>>>> but now it's:
>>>>> spanned pages - absent pages - memmap pages.
>>>>> And it didn't consider whether the memmap pages is actully allocated
>>>>> from the
>>>>> zone or not which may cause problem when memory hotplug is improved
>>>>> recently.
>>>>>
>>>>> For example:
>>>>> numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
>>>>> bootmem
>>>>> allocated from ZONE_MOVABLE.
>>>>> So ZONE_NORMAL's present_pages should be spanned pages - absent
>>>>> pages, but now
>>>>> it also minus memmap pages, which are actually allocated from
>>>>> ZONE_MOVABLE.
>>>>> This is wrong and when offlining all memory of this zone:
>>>>> (zone->present_pages -= offline_pages) will less than 0.
>>>>> Since present_pages is unsigned long type, that is actually a very
>>>>> large
>>>>> integer which will cause zone->watermark[WMARK_MIN] becomes a large
>>>>> integer too(see setup_per_zone_wmarks()).
>>>>> As a result, totalreserve_pages become a large integer also and
>>>>> finally memory
>>>>> allocating will fail in __vm_enough_memory().
>>>>>
>>>>> Related discuss:
>>>>> http://lkml.org/lkml/2012/11/5/866
>>>>> https://patchwork.kernel.org/patch/1346751/
>>>>>
>>>>> Related patches in mmotm:
>>>>> mm: fix-up zone present pages(7f1290f2f2a4d2c) (sometimes cause
>>>>> egression)
>>>>> mm: fix a regression with HIGHMEM(fe2cebd5a259eec) (Andrew have some
>>>>> feedback)
>>>>>
>>>>> Jiang Liu have sent a series patches to fix this issue by adding a
>>>>> managed_pages area to zone struct:
>>>>> [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
>>>>>
>>>>> But i think it's too complicated.
>>>>> Mine is based on the two related patches already in mmotm(need to
>>>>> revert them
>>>>> first)
>>>>> It fix the calculation of zone->present_pages by:
>>>>> 1. Reset the zone->present_pages to zero before
>>>>> free_all_bootmem(),free_all_bootmem_node() and
>>>>> free_low_memory_core_early().
>>>>> I think these should already included all path in all arch.
>>>>>
>>>>> 2. If there is a page freed to buddy system in __free_pages_bootmem(),
>>>>> add zone->present_pages accrodingly.
>>>>>
>>>>> Note this patch assumes that bootmem won't use memory above
>>>>> ZONE_HIGHMEM, so
>>>> Hmm, on x86_64 box, bootmem uses the memory in ZONE_MOVABLE and
>>>> ZONE_MOVABLE > ZONE_HIGHMEM.
>>> That's an issue, I'm trying to avoid allocating bootmem from
>>> ZONE_MOVABLE.
>>> And is_highmem(z) or is_highmem_idx(zoneidx) could be safely used to
>>> distinguish
>>> movable highmem zones.
>> Hi Jiang,
>>
>> - I'm not sure why the above mentioned bootmem use the memory in
>> ZONE_MOVABLE, IIUR, because current nobootmem/memblock logic will alloc
>> pages from highest available memory to lowest available memory to avoid
>> fragmentation. Is it correct?
>> - why need avoid allocating bootmem from ZONE_MOVABLE?
> ZONE_MOVABLE means that, the memory can be moved to the other place. So
> we can offline it. It is very useful for memory hotplug.

Hi Congyang,

Thanks for your quick response. Then how to distinguish the memblock you 
about to remove is allocated from bootmem instead of buddy system during 
memory offline?

Regards,
Jaegeuk

>
> Thanks
> Wen Congyang
>
>> Regards,
>> Jaejeuk
>>>>> only zones below ZONE_HIGHMEM are reset/fixed. If not, some update
>>>>> is needed.
>>>>> For ZONE_HIGHMEM, only fix it's init value to:
>>>>> panned_pages - absent_pages in free_area_init_core().
>>>>>
>>>>> Only did some simple test currently.
>>>>>
>>>>> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
>>>>> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
>>>>> Signed-off-by: Bob Liu <lliubbo@gmail.com>
>>>>> ---
>>>>>    include/linux/mm.h |    3 +++
>>>>>    mm/bootmem.c       |    2 ++
>>>>>    mm/nobootmem.c     |    1 +
>>>>>    mm/page_alloc.c    |   49
>>>>> +++++++++++++++++++++++++++++++++++++------------
>>>>>    4 files changed, 43 insertions(+), 12 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>> index 7b03cab..3b40eb6 100644
>>>>> --- a/include/linux/mm.h
>>>>> +++ b/include/linux/mm.h
>>>>> @@ -1763,5 +1763,8 @@ static inline unsigned int
>>>>> debug_guardpage_minorder(void) { return 0; }
>>>>>    static inline bool page_is_guard(struct page *page) { return false; }
>>>>>    #endif /* CONFIG_DEBUG_PAGEALLOC */
>>>>>    +extern void reset_lowmem_zone_present_pages_pernode(pg_data_t
>>>>> *pgdat);
>>>>> +extern void reset_lowmem_zone_present_pages(void);
>>>>> +
>>>>>    #endif /* __KERNEL__ */
>>>>>    #endif /* _LINUX_MM_H */
>>>>> diff --git a/mm/bootmem.c b/mm/bootmem.c
>>>>> index 26d057a..661775b 100644
>>>>> --- a/mm/bootmem.c
>>>>> +++ b/mm/bootmem.c
>>>>> @@ -238,6 +238,7 @@ static unsigned long __init
>>>>> free_all_bootmem_core(bootmem_data_t *bdata)
>>>>>    unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
>>>>>    {
>>>>>        register_page_bootmem_info_node(pgdat);
>>>>> +    reset_lowmem_zone_present_pages_pernode(pgdat);
>>>>>        return free_all_bootmem_core(pgdat->bdata);
>>>>>    }
>>>>>    @@ -251,6 +252,7 @@ unsigned long __init free_all_bootmem(void)
>>>>>        unsigned long total_pages = 0;
>>>>>        bootmem_data_t *bdata;
>>>>>    +    reset_lowmem_zone_present_pages();
>>>>>        list_for_each_entry(bdata, &bdata_list, list)
>>>>>            total_pages += free_all_bootmem_core(bdata);
>>>>>    diff --git a/mm/nobootmem.c b/mm/nobootmem.c
>>>>> index bd82f6b..378d50a 100644
>>>>> --- a/mm/nobootmem.c
>>>>> +++ b/mm/nobootmem.c
>>>>> @@ -126,6 +126,7 @@ unsigned long __init
>>>>> free_low_memory_core_early(int nodeid)
>>>>>        phys_addr_t start, end, size;
>>>>>        u64 i;
>>>>>    +    reset_lowmem_zone_present_pages();
>>>>>        for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
>>>>>            count += __free_memory_core(start, end);
>>>>>    diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index 07425a7..76d37f0 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -735,6 +735,7 @@ void __meminit __free_pages_bootmem(struct page
>>>>> *page, unsigned int order)
>>>>>    {
>>>>>        unsigned int nr_pages = 1 << order;
>>>>>        unsigned int loop;
>>>>> +    struct zone *zone;
>>>>>          prefetchw(page);
>>>>>        for (loop = 0; loop < nr_pages; loop++) {
>>>>> @@ -748,6 +749,9 @@ void __meminit __free_pages_bootmem(struct page
>>>>> *page, unsigned int order)
>>>>>          set_page_refcounted(page);
>>>>>        __free_pages(page, order);
>>>>> +    zone = page_zone(page);
>>>>> +    WARN_ON(!(is_normal(zone) || is_dma(zone) || is_dma32(zone)));
>>>>> +    zone->present_pages += nr_pages;
>>>>>    }
>>>>>      #ifdef CONFIG_CMA
>>>>> @@ -4547,18 +4551,20 @@ static void __paginginit
>>>>> free_area_init_core(struct pglist_data *pgdat,
>>>>>             * is used by this zone for memmap. This affects the
>>>>> watermark
>>>>>             * and per-cpu initialisations
>>>>>             */
>>>>> -        memmap_pages =
>>>>> -            PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>>>>> -        if (realsize >= memmap_pages) {
>>>>> -            realsize -= memmap_pages;
>>>>> -            if (memmap_pages)
>>>>> -                printk(KERN_DEBUG
>>>>> -                       "  %s zone: %lu pages used for memmap\n",
>>>>> -                       zone_names[j], memmap_pages);
>>>>> -        } else
>>>>> -            printk(KERN_WARNING
>>>>> -                "  %s zone: %lu pages exceeds realsize %lu\n",
>>>>> -                zone_names[j], memmap_pages, realsize);
>>>>> +        if (j < ZONE_HIGHMEM) {
>>>>> +            memmap_pages =
>>>>> +                PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>>>>> +            if (realsize >= memmap_pages) {
>>>>> +                realsize -= memmap_pages;
>>>>> +                if (memmap_pages)
>>>>> +                    printk(KERN_DEBUG
>>>>> +                            "  %s zone: %lu pages used for memmap\n",
>>>>> +                            zone_names[j], memmap_pages);
>>>>> +            } else
>>>>> +                printk(KERN_WARNING
>>>>> +                        "  %s zone: %lu pages exceeds realsize %lu\n",
>>>>> +                        zone_names[j], memmap_pages, realsize);
>>>>> +        }
>>>>>              /* Account for reserved pages */
>>>>>            if (j == 0 && realsize > dma_reserve) {
>>>>> @@ -6143,3 +6149,22 @@ void dump_page(struct page *page)
>>>>>        dump_page_flags(page->flags);
>>>>>        mem_cgroup_print_bad_page(page);
>>>>>    }
>>>>> +
>>>>> +/* reset zone->present_pages to 0 for zones below ZONE_HIGHMEM */
>>>>> +void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat)
>>>>> +{
>>>>> +    int i;
>>>>> +    struct zone *z;
>>>>> +    for (i = 0; i < ZONE_HIGHMEM; i++) {
>>>> And if CONFIG_HIGHMEM is no, ZONE_NORMAL == ZONE_HIGHMEM.
>>>>
>>>> So, you don't reset ZONE_NORMAL here.
>>>>
>>>>> +        z = pgdat->node_zones + i;
>>>>> +        z->present_pages = 0;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +void reset_lowmem_zone_present_pages(void)
>>>>> +{
>>>>> +    int nid;
>>>>> +
>>>>> +    for_each_node_state(nid, N_HIGH_MEMORY)
>>>>> +        reset_lowmem_zone_present_pages_pernode(NODE_DATA(nid));
>>>>> +}
>>>> .
>>>>
>>> -- 
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: fix up zone's present_pages
  2012-11-19  9:07 ` Jiang Liu
@ 2012-11-20  7:44   ` Bob Liu
  0 siblings, 0 replies; 8+ messages in thread
From: Bob Liu @ 2012-11-20  7:44 UTC (permalink / raw)
  To: Jiang Liu
  Cc: akpm, maciej.rutecki, chris2553, rjw, mgorman, minchan,
	kamezawa.hiroyu, mhocko, wency, daniel.vetter, rientjes,
	wujianguo, ptesarik, riel, linux-mm

On Mon, Nov 19, 2012 at 5:07 PM, Jiang Liu <jiang.liu@huawei.com> wrote:
> On 2012-11-19 16:45, Bob Liu wrote:
>> zone->present_pages shoule be:
>> spanned pages - absent pages - bootmem pages(including memmap pages),
>> but now it's:
>> spanned pages - absent pages - memmap pages.
>> And it didn't consider whether the memmap pages is actully allocated from the
>> zone or not which may cause problem when memory hotplug is improved recently.
>>
>> For example:
>> numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem
>> allocated from ZONE_MOVABLE.
>> So ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now
>> it also minus memmap pages, which are actually allocated from ZONE_MOVABLE.
>> This is wrong and when offlining all memory of this zone:
>> (zone->present_pages -= offline_pages) will less than 0.
>> Since present_pages is unsigned long type, that is actually a very large
>> integer which will cause zone->watermark[WMARK_MIN] becomes a large
>> integer too(see setup_per_zone_wmarks()).
>> As a result, totalreserve_pages become a large integer also and finally memory
>> allocating will fail in __vm_enough_memory().
>>
>> Related discuss:
>> http://lkml.org/lkml/2012/11/5/866
>> https://patchwork.kernel.org/patch/1346751/
>>
>> Related patches in mmotm:
>> mm: fix-up zone present pages(7f1290f2f2a4d2c) (sometimes cause egression)
>> mm: fix a regression with HIGHMEM(fe2cebd5a259eec) (Andrew have some feedback)
>>
>> Jiang Liu have sent a series patches to fix this issue by adding a
>> managed_pages area to zone struct:
>> [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
>>
>> But i think it's too complicated.
>> Mine is based on the two related patches already in mmotm(need to revert them
>> first)
>> It fix the calculation of zone->present_pages by:
>> 1. Reset the zone->present_pages to zero before
>> free_all_bootmem(),free_all_bootmem_node() and free_low_memory_core_early().
>> I think these should already included all path in all arch.
>>
>> 2. If there is a page freed to buddy system in __free_pages_bootmem(),
>> add zone->present_pages accrodingly.
>>
>> Note this patch assumes that bootmem won't use memory above ZONE_HIGHMEM, so
>> only zones below ZONE_HIGHMEM are reset/fixed. If not, some update is needed.
>> For ZONE_HIGHMEM, only fix it's init value to:
>> panned_pages - absent_pages in free_area_init_core().
>>
>> Only did some simple test currently.
> Hi Bob,
>         Great to know that you are working on this issue too.
> Originally I have thought about reusing the zone->present_pages.
> And later I propose to add the new field "managed_pages" because
> we could know that there are some pages not managed by the buddy
> system in the zone (present_pages - managed_pages). This may help
> the ongoing memory power management work from Srivatsa S. Bhat
> because we can't put memory ranges into low power state if there
> are unmanaged pages.
> And pgdat->node_present_pages = spanned_pages - absent_pages,
> so it would be better to keep consistence with node_present_pages
> by setting zone->present_pages = spanned_pages - absent_pages.

Okay, and i have seen feedback from Andrew in your series.
I will spend more time on it when i am free.
Sorry for the noise.

>
>>
>> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
>> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
>> Signed-off-by: Bob Liu <lliubbo@gmail.com>
>> ---
>>  include/linux/mm.h |    3 +++
>>  mm/bootmem.c       |    2 ++
>>  mm/nobootmem.c     |    1 +
>>  mm/page_alloc.c    |   49 +++++++++++++++++++++++++++++++++++++------------
>>  4 files changed, 43 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 7b03cab..3b40eb6 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1763,5 +1763,8 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
>>  static inline bool page_is_guard(struct page *page) { return false; }
>>  #endif /* CONFIG_DEBUG_PAGEALLOC */
>>
>> +extern void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat);
>> +extern void reset_lowmem_zone_present_pages(void);
>> +
>>  #endif /* __KERNEL__ */
>>  #endif /* _LINUX_MM_H */
>> diff --git a/mm/bootmem.c b/mm/bootmem.c
>> index 26d057a..661775b 100644
>> --- a/mm/bootmem.c
>> +++ b/mm/bootmem.c
>> @@ -238,6 +238,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
>>  unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
>>  {
>>       register_page_bootmem_info_node(pgdat);
>> +     reset_lowmem_zone_present_pages_pernode(pgdat);
>>       return free_all_bootmem_core(pgdat->bdata);
>>  }
>>
>> @@ -251,6 +252,7 @@ unsigned long __init free_all_bootmem(void)
>>       unsigned long total_pages = 0;
>>       bootmem_data_t *bdata;
>>
>> +     reset_lowmem_zone_present_pages();
>>       list_for_each_entry(bdata, &bdata_list, list)
>>               total_pages += free_all_bootmem_core(bdata);
>>
>> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
>> index bd82f6b..378d50a 100644
>> --- a/mm/nobootmem.c
>> +++ b/mm/nobootmem.c
>> @@ -126,6 +126,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
>>       phys_addr_t start, end, size;
>>       u64 i;
>>
>> +     reset_lowmem_zone_present_pages();
>>       for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
>>               count += __free_memory_core(start, end);
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 07425a7..76d37f0 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -735,6 +735,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>>  {
>>       unsigned int nr_pages = 1 << order;
>>       unsigned int loop;
>> +     struct zone *zone;
>>
>>       prefetchw(page);
>>       for (loop = 0; loop < nr_pages; loop++) {
>> @@ -748,6 +749,9 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>>
>>       set_page_refcounted(page);
>>       __free_pages(page, order);
>> +     zone = page_zone(page);
>> +     WARN_ON(!(is_normal(zone) || is_dma(zone) || is_dma32(zone)));
>> +     zone->present_pages += nr_pages;
>>  }
>>
>>  #ifdef CONFIG_CMA
>> @@ -4547,18 +4551,20 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>>                * is used by this zone for memmap. This affects the watermark
>>                * and per-cpu initialisations
>>                */
>> -             memmap_pages =
>> -                     PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>> -             if (realsize >= memmap_pages) {
>> -                     realsize -= memmap_pages;
>> -                     if (memmap_pages)
>> -                             printk(KERN_DEBUG
>> -                                    "  %s zone: %lu pages used for memmap\n",
>> -                                    zone_names[j], memmap_pages);
>> -             } else
>> -                     printk(KERN_WARNING
>> -                             "  %s zone: %lu pages exceeds realsize %lu\n",
>> -                             zone_names[j], memmap_pages, realsize);
>> +             if (j < ZONE_HIGHMEM) {
>> +                     memmap_pages =
>> +                             PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>> +                     if (realsize >= memmap_pages) {
>> +                             realsize -= memmap_pages;
>> +                             if (memmap_pages)
>> +                                     printk(KERN_DEBUG
>> +                                                     "  %s zone: %lu pages used for memmap\n",
>> +                                                     zone_names[j], memmap_pages);
>> +                     } else
>> +                             printk(KERN_WARNING
>> +                                             "  %s zone: %lu pages exceeds realsize %lu\n",
>> +                                             zone_names[j], memmap_pages, realsize);
>> +             }
>>
>>               /* Account for reserved pages */
>>               if (j == 0 && realsize > dma_reserve) {
>> @@ -6143,3 +6149,22 @@ void dump_page(struct page *page)
>>       dump_page_flags(page->flags);
>>       mem_cgroup_print_bad_page(page);
>>  }
>> +
>> +/* reset zone->present_pages to 0 for zones below ZONE_HIGHMEM */
>> +void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat)
>> +{
>> +     int i;
>> +     struct zone *z;
>> +     for (i = 0; i < ZONE_HIGHMEM; i++) {
>> +             z = pgdat->node_zones + i;
>> +             z->present_pages = 0;
>> +     }
>> +}
>> +
>> +void reset_lowmem_zone_present_pages(void)
>> +{
>> +     int nid;
>> +
>> +     for_each_node_state(nid, N_HIGH_MEMORY)
>> +             reset_lowmem_zone_present_pages_pernode(NODE_DATA(nid));
>> +}
>
>

-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-11-20  7:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-19  8:45 [RFC PATCH] mm: fix up zone's present_pages Bob Liu
2012-11-19  9:07 ` Jiang Liu
2012-11-20  7:44   ` Bob Liu
2012-11-19  9:12 ` Wen Congyang
2012-11-19  9:11   ` Jiang Liu
2012-11-20  3:47     ` Jaegeuk Hanse
2012-11-20  3:59       ` Wen Congyang
2012-11-20  4:11         ` Jaegeuk Hanse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).