linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
To: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>, Bob Liu <lliubbo@gmail.com>,
	akpm@linux-foundation.org, maciej.rutecki@gmail.com,
	chris2553@googlemail.com, rjw@sisk.pl, mgorman@suse.de,
	minchan@kernel.org, kamezawa.hiroyu@jp.fujitsu.com,
	mhocko@suse.cz, daniel.vetter@ffwll.ch, rientjes@google.com,
	wujianguo@huawei.com, ptesarik@suse.cz, riel@redhat.com,
	linux-mm@kvack.org, lai jiangshan <laijs@cn.fujitsu.com>
Subject: Re: [RFC PATCH] mm: fix up zone's present_pages
Date: Tue, 20 Nov 2012 11:47:53 +0800	[thread overview]
Message-ID: <50AAFD69.7080507@gmail.com> (raw)
In-Reply-To: <50A9F7BE.1050808@huawei.com>

On 11/19/2012 05:11 PM, Jiang Liu wrote:
> On 2012-11-19 17:12, Wen Congyang wrote:
>> At 11/19/2012 04:45 PM, Bob Liu Wrote:
>>> zone->present_pages shoule be:
>>> spanned pages - absent pages - bootmem pages(including memmap pages),
>>> but now it's:
>>> spanned pages - absent pages - memmap pages.
>>> And it didn't consider whether the memmap pages is actully allocated from the
>>> zone or not which may cause problem when memory hotplug is improved recently.
>>>
>>> For example:
>>> numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem
>>> allocated from ZONE_MOVABLE.
>>> So ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now
>>> it also minus memmap pages, which are actually allocated from ZONE_MOVABLE.
>>> This is wrong and when offlining all memory of this zone:
>>> (zone->present_pages -= offline_pages) will less than 0.
>>> Since present_pages is unsigned long type, that is actually a very large
>>> integer which will cause zone->watermark[WMARK_MIN] becomes a large
>>> integer too(see setup_per_zone_wmarks()).
>>> As a result, totalreserve_pages become a large integer also and finally memory
>>> allocating will fail in __vm_enough_memory().
>>>
>>> Related discuss:
>>> http://lkml.org/lkml/2012/11/5/866
>>> https://patchwork.kernel.org/patch/1346751/
>>>
>>> Related patches in mmotm:
>>> mm: fix-up zone present pages(7f1290f2f2a4d2c) (sometimes cause egression)
>>> mm: fix a regression with HIGHMEM(fe2cebd5a259eec) (Andrew have some feedback)
>>>
>>> Jiang Liu have sent a series patches to fix this issue by adding a
>>> managed_pages area to zone struct:
>>> [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
>>>
>>> But i think it's too complicated.
>>> Mine is based on the two related patches already in mmotm(need to revert them
>>> first)
>>> It fix the calculation of zone->present_pages by:
>>> 1. Reset the zone->present_pages to zero before
>>> free_all_bootmem(),free_all_bootmem_node() and free_low_memory_core_early().
>>> I think these should already included all path in all arch.
>>>
>>> 2. If there is a page freed to buddy system in __free_pages_bootmem(),
>>> add zone->present_pages accrodingly.
>>>
>>> Note this patch assumes that bootmem won't use memory above ZONE_HIGHMEM, so
>> Hmm, on x86_64 box, bootmem uses the memory in ZONE_MOVABLE and
>> ZONE_MOVABLE > ZONE_HIGHMEM.
> That's an issue, I'm trying to avoid allocating bootmem from ZONE_MOVABLE.
> And is_highmem(z) or is_highmem_idx(zoneidx) could be safely used to distinguish
> movable highmem zones.
Hi Jiang,

- I'm not sure why the above mentioned bootmem use the memory in 
ZONE_MOVABLE, IIUR, because current nobootmem/memblock logic will alloc 
pages from highest available memory to lowest available memory to avoid 
fragmentation. Is it correct?
- why need avoid allocating bootmem from ZONE_MOVABLE?

Regards,
Jaejeuk
>
>>> only zones below ZONE_HIGHMEM are reset/fixed. If not, some update is needed.
>>> For ZONE_HIGHMEM, only fix it's init value to:
>>> panned_pages - absent_pages in free_area_init_core().
>>>
>>> Only did some simple test currently.
>>>
>>> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
>>> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
>>> Signed-off-by: Bob Liu <lliubbo@gmail.com>
>>> ---
>>>   include/linux/mm.h |    3 +++
>>>   mm/bootmem.c       |    2 ++
>>>   mm/nobootmem.c     |    1 +
>>>   mm/page_alloc.c    |   49 +++++++++++++++++++++++++++++++++++++------------
>>>   4 files changed, 43 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 7b03cab..3b40eb6 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1763,5 +1763,8 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
>>>   static inline bool page_is_guard(struct page *page) { return false; }
>>>   #endif /* CONFIG_DEBUG_PAGEALLOC */
>>>   
>>> +extern void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat);
>>> +extern void reset_lowmem_zone_present_pages(void);
>>> +
>>>   #endif /* __KERNEL__ */
>>>   #endif /* _LINUX_MM_H */
>>> diff --git a/mm/bootmem.c b/mm/bootmem.c
>>> index 26d057a..661775b 100644
>>> --- a/mm/bootmem.c
>>> +++ b/mm/bootmem.c
>>> @@ -238,6 +238,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
>>>   unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
>>>   {
>>>   	register_page_bootmem_info_node(pgdat);
>>> +	reset_lowmem_zone_present_pages_pernode(pgdat);
>>>   	return free_all_bootmem_core(pgdat->bdata);
>>>   }
>>>   
>>> @@ -251,6 +252,7 @@ unsigned long __init free_all_bootmem(void)
>>>   	unsigned long total_pages = 0;
>>>   	bootmem_data_t *bdata;
>>>   
>>> +	reset_lowmem_zone_present_pages();
>>>   	list_for_each_entry(bdata, &bdata_list, list)
>>>   		total_pages += free_all_bootmem_core(bdata);
>>>   
>>> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
>>> index bd82f6b..378d50a 100644
>>> --- a/mm/nobootmem.c
>>> +++ b/mm/nobootmem.c
>>> @@ -126,6 +126,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
>>>   	phys_addr_t start, end, size;
>>>   	u64 i;
>>>   
>>> +	reset_lowmem_zone_present_pages();
>>>   	for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
>>>   		count += __free_memory_core(start, end);
>>>   
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 07425a7..76d37f0 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -735,6 +735,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>>>   {
>>>   	unsigned int nr_pages = 1 << order;
>>>   	unsigned int loop;
>>> +	struct zone *zone;
>>>   
>>>   	prefetchw(page);
>>>   	for (loop = 0; loop < nr_pages; loop++) {
>>> @@ -748,6 +749,9 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>>>   
>>>   	set_page_refcounted(page);
>>>   	__free_pages(page, order);
>>> +	zone = page_zone(page);
>>> +	WARN_ON(!(is_normal(zone) || is_dma(zone) || is_dma32(zone)));
>>> +	zone->present_pages += nr_pages;
>>>   }
>>>   
>>>   #ifdef CONFIG_CMA
>>> @@ -4547,18 +4551,20 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>>>   		 * is used by this zone for memmap. This affects the watermark
>>>   		 * and per-cpu initialisations
>>>   		 */
>>> -		memmap_pages =
>>> -			PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>>> -		if (realsize >= memmap_pages) {
>>> -			realsize -= memmap_pages;
>>> -			if (memmap_pages)
>>> -				printk(KERN_DEBUG
>>> -				       "  %s zone: %lu pages used for memmap\n",
>>> -				       zone_names[j], memmap_pages);
>>> -		} else
>>> -			printk(KERN_WARNING
>>> -				"  %s zone: %lu pages exceeds realsize %lu\n",
>>> -				zone_names[j], memmap_pages, realsize);
>>> +		if (j < ZONE_HIGHMEM) {
>>> +			memmap_pages =
>>> +				PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
>>> +			if (realsize >= memmap_pages) {
>>> +				realsize -= memmap_pages;
>>> +				if (memmap_pages)
>>> +					printk(KERN_DEBUG
>>> +							"  %s zone: %lu pages used for memmap\n",
>>> +							zone_names[j], memmap_pages);
>>> +			} else
>>> +				printk(KERN_WARNING
>>> +						"  %s zone: %lu pages exceeds realsize %lu\n",
>>> +						zone_names[j], memmap_pages, realsize);
>>> +		}
>>>   
>>>   		/* Account for reserved pages */
>>>   		if (j == 0 && realsize > dma_reserve) {
>>> @@ -6143,3 +6149,22 @@ void dump_page(struct page *page)
>>>   	dump_page_flags(page->flags);
>>>   	mem_cgroup_print_bad_page(page);
>>>   }
>>> +
>>> +/* reset zone->present_pages to 0 for zones below ZONE_HIGHMEM */
>>> +void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat)
>>> +{
>>> +	int i;
>>> +	struct zone *z;
>>> +	for (i = 0; i < ZONE_HIGHMEM; i++) {
>> And if CONFIG_HIGHMEM is no, ZONE_NORMAL == ZONE_HIGHMEM.
>>
>> So, you don't reset ZONE_NORMAL here.
>>
>>> +		z = pgdat->node_zones + i;
>>> +		z->present_pages = 0;
>>> +	}
>>> +}
>>> +
>>> +void reset_lowmem_zone_present_pages(void)
>>> +{
>>> +	int nid;
>>> +
>>> +	for_each_node_state(nid, N_HIGH_MEMORY)
>>> +		reset_lowmem_zone_present_pages_pernode(NODE_DATA(nid));
>>> +}
>>
>> .
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2012-11-20  3:48 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-19  8:45 [RFC PATCH] mm: fix up zone's present_pages Bob Liu
2012-11-19  9:07 ` Jiang Liu
2012-11-20  7:44   ` Bob Liu
2012-11-19  9:12 ` Wen Congyang
2012-11-19  9:11   ` Jiang Liu
2012-11-20  3:47     ` Jaegeuk Hanse [this message]
2012-11-20  3:59       ` Wen Congyang
2012-11-20  4:11         ` Jaegeuk Hanse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50AAFD69.7080507@gmail.com \
    --to=jaegeuk.hanse@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chris2553@googlemail.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=jiang.liu@huawei.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=lliubbo@gmail.com \
    --cc=maciej.rutecki@gmail.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=minchan@kernel.org \
    --cc=ptesarik@suse.cz \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=rjw@sisk.pl \
    --cc=wency@cn.fujitsu.com \
    --cc=wujianguo@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).