From: Jiang Liu <jiang.liu@huawei.com>
To: Bob Liu <lliubbo@gmail.com>
Cc: akpm@linux-foundation.org, maciej.rutecki@gmail.com,
chris2553@googlemail.com, rjw@sisk.pl, mgorman@suse.de,
minchan@kernel.org, kamezawa.hiroyu@jp.fujitsu.com,
mhocko@suse.cz, wency@cn.fujitsu.com, daniel.vetter@ffwll.ch,
rientjes@google.com, wujianguo@huawei.com, ptesarik@suse.cz,
riel@redhat.com, linux-mm@kvack.org
Subject: Re: [RFC PATCH] mm: fix up zone's present_pages
Date: Mon, 19 Nov 2012 17:07:40 +0800 [thread overview]
Message-ID: <50A9F6DC.6050408@huawei.com> (raw)
In-Reply-To: <1353314707-31834-1-git-send-email-lliubbo@gmail.com>
On 2012-11-19 16:45, Bob Liu wrote:
> zone->present_pages shoule be:
> spanned pages - absent pages - bootmem pages(including memmap pages),
> but now it's:
> spanned pages - absent pages - memmap pages.
> And it didn't consider whether the memmap pages is actully allocated from the
> zone or not which may cause problem when memory hotplug is improved recently.
>
> For example:
> numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem
> allocated from ZONE_MOVABLE.
> So ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now
> it also minus memmap pages, which are actually allocated from ZONE_MOVABLE.
> This is wrong and when offlining all memory of this zone:
> (zone->present_pages -= offline_pages) will less than 0.
> Since present_pages is unsigned long type, that is actually a very large
> integer which will cause zone->watermark[WMARK_MIN] becomes a large
> integer too(see setup_per_zone_wmarks()).
> As a result, totalreserve_pages become a large integer also and finally memory
> allocating will fail in __vm_enough_memory().
>
> Related discuss:
> http://lkml.org/lkml/2012/11/5/866
> https://patchwork.kernel.org/patch/1346751/
>
> Related patches in mmotm:
> mm: fix-up zone present pages(7f1290f2f2a4d2c) (sometimes cause egression)
> mm: fix a regression with HIGHMEM(fe2cebd5a259eec) (Andrew have some feedback)
>
> Jiang Liu have sent a series patches to fix this issue by adding a
> managed_pages area to zone struct:
> [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
>
> But i think it's too complicated.
> Mine is based on the two related patches already in mmotm(need to revert them
> first)
> It fix the calculation of zone->present_pages by:
> 1. Reset the zone->present_pages to zero before
> free_all_bootmem(),free_all_bootmem_node() and free_low_memory_core_early().
> I think these should already included all path in all arch.
>
> 2. If there is a page freed to buddy system in __free_pages_bootmem(),
> add zone->present_pages accrodingly.
>
> Note this patch assumes that bootmem won't use memory above ZONE_HIGHMEM, so
> only zones below ZONE_HIGHMEM are reset/fixed. If not, some update is needed.
> For ZONE_HIGHMEM, only fix it's init value to:
> panned_pages - absent_pages in free_area_init_core().
>
> Only did some simple test currently.
Hi Bobi 1/4 ?
Great to know that you are working on this issue too.
Originally I have thought about reusing the zone->present_pages.
And later I propose to add the new field "managed_pages" because
we could know that there are some pages not managed by the buddy
system in the zone (present_pages - managed_pages). This may help
the ongoing memory power management work from Srivatsa S. Bhat
because we can't put memory ranges into low power state if there
are unmanaged pages.
And pgdat->node_present_pages = spanned_pages - absent_pages,
so it would be better to keep consistence with node_present_pages
by setting zone->present_pages = spanned_pages - absent_pages.
Thanks!
Gerry
>
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
> Signed-off-by: Bob Liu <lliubbo@gmail.com>
> ---
> include/linux/mm.h | 3 +++
> mm/bootmem.c | 2 ++
> mm/nobootmem.c | 1 +
> mm/page_alloc.c | 49 +++++++++++++++++++++++++++++++++++++------------
> 4 files changed, 43 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7b03cab..3b40eb6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1763,5 +1763,8 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
> static inline bool page_is_guard(struct page *page) { return false; }
> #endif /* CONFIG_DEBUG_PAGEALLOC */
>
> +extern void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat);
> +extern void reset_lowmem_zone_present_pages(void);
> +
> #endif /* __KERNEL__ */
> #endif /* _LINUX_MM_H */
> diff --git a/mm/bootmem.c b/mm/bootmem.c
> index 26d057a..661775b 100644
> --- a/mm/bootmem.c
> +++ b/mm/bootmem.c
> @@ -238,6 +238,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
> unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
> {
> register_page_bootmem_info_node(pgdat);
> + reset_lowmem_zone_present_pages_pernode(pgdat);
> return free_all_bootmem_core(pgdat->bdata);
> }
>
> @@ -251,6 +252,7 @@ unsigned long __init free_all_bootmem(void)
> unsigned long total_pages = 0;
> bootmem_data_t *bdata;
>
> + reset_lowmem_zone_present_pages();
> list_for_each_entry(bdata, &bdata_list, list)
> total_pages += free_all_bootmem_core(bdata);
>
> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> index bd82f6b..378d50a 100644
> --- a/mm/nobootmem.c
> +++ b/mm/nobootmem.c
> @@ -126,6 +126,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
> phys_addr_t start, end, size;
> u64 i;
>
> + reset_lowmem_zone_present_pages();
> for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
> count += __free_memory_core(start, end);
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 07425a7..76d37f0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -735,6 +735,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
> {
> unsigned int nr_pages = 1 << order;
> unsigned int loop;
> + struct zone *zone;
>
> prefetchw(page);
> for (loop = 0; loop < nr_pages; loop++) {
> @@ -748,6 +749,9 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>
> set_page_refcounted(page);
> __free_pages(page, order);
> + zone = page_zone(page);
> + WARN_ON(!(is_normal(zone) || is_dma(zone) || is_dma32(zone)));
> + zone->present_pages += nr_pages;
> }
>
> #ifdef CONFIG_CMA
> @@ -4547,18 +4551,20 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
> * is used by this zone for memmap. This affects the watermark
> * and per-cpu initialisations
> */
> - memmap_pages =
> - PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
> - if (realsize >= memmap_pages) {
> - realsize -= memmap_pages;
> - if (memmap_pages)
> - printk(KERN_DEBUG
> - " %s zone: %lu pages used for memmap\n",
> - zone_names[j], memmap_pages);
> - } else
> - printk(KERN_WARNING
> - " %s zone: %lu pages exceeds realsize %lu\n",
> - zone_names[j], memmap_pages, realsize);
> + if (j < ZONE_HIGHMEM) {
> + memmap_pages =
> + PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
> + if (realsize >= memmap_pages) {
> + realsize -= memmap_pages;
> + if (memmap_pages)
> + printk(KERN_DEBUG
> + " %s zone: %lu pages used for memmap\n",
> + zone_names[j], memmap_pages);
> + } else
> + printk(KERN_WARNING
> + " %s zone: %lu pages exceeds realsize %lu\n",
> + zone_names[j], memmap_pages, realsize);
> + }
>
> /* Account for reserved pages */
> if (j == 0 && realsize > dma_reserve) {
> @@ -6143,3 +6149,22 @@ void dump_page(struct page *page)
> dump_page_flags(page->flags);
> mem_cgroup_print_bad_page(page);
> }
> +
> +/* reset zone->present_pages to 0 for zones below ZONE_HIGHMEM */
> +void reset_lowmem_zone_present_pages_pernode(pg_data_t *pgdat)
> +{
> + int i;
> + struct zone *z;
> + for (i = 0; i < ZONE_HIGHMEM; i++) {
> + z = pgdat->node_zones + i;
> + z->present_pages = 0;
> + }
> +}
> +
> +void reset_lowmem_zone_present_pages(void)
> +{
> + int nid;
> +
> + for_each_node_state(nid, N_HIGH_MEMORY)
> + reset_lowmem_zone_present_pages_pernode(NODE_DATA(nid));
> +}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-11-19 9:08 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-19 8:45 [RFC PATCH] mm: fix up zone's present_pages Bob Liu
2012-11-19 9:07 ` Jiang Liu [this message]
2012-11-20 7:44 ` Bob Liu
2012-11-19 9:12 ` Wen Congyang
2012-11-19 9:11 ` Jiang Liu
2012-11-20 3:47 ` Jaegeuk Hanse
2012-11-20 3:59 ` Wen Congyang
2012-11-20 4:11 ` Jaegeuk Hanse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50A9F6DC.6050408@huawei.com \
--to=jiang.liu@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=chris2553@googlemail.com \
--cc=daniel.vetter@ffwll.ch \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=lliubbo@gmail.com \
--cc=maciej.rutecki@gmail.com \
--cc=mgorman@suse.de \
--cc=mhocko@suse.cz \
--cc=minchan@kernel.org \
--cc=ptesarik@suse.cz \
--cc=riel@redhat.com \
--cc=rientjes@google.com \
--cc=rjw@sisk.pl \
--cc=wency@cn.fujitsu.com \
--cc=wujianguo@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).