From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx167.postini.com [74.125.245.167]) by kanga.kvack.org (Postfix) with SMTP id 7C63D6B0032 for ; Thu, 11 Jul 2013 22:04:04 -0400 (EDT) From: Robin Holt Subject: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Date: Thu, 11 Jul 2013 21:03:51 -0500 Message-Id: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman We have been working on this since we returned from shutdown and have something to discuss now. We restricted ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. First, I think I want to propose getting rid of the page flag. If I knew of a concrete way to determine that the page has not been initialized, this patch series would look different. If there is no definitive way to determine that the struct page has been initialized aside from checking the entire page struct is zero, then I think I would suggest we change the page flag to indicate the page has been initialized. The heart of the problem as I see it comes from expand(). We nearly always see a first reference to a struct page which is in the middle of the 2MiB region. Due to that access, the unlikely() check that was originally proposed really ends up referencing a different page entirely. We actually did not introduce an unlikely and refactor the patches to make that unlikely inside a static inline function. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. I am on vacation for the next three days so I am sorry in advance for my infrequent or non-existant responses. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id 5810C6B0033 for ; Thu, 11 Jul 2013 22:04:07 -0400 (EDT) From: Robin Holt Subject: [RFC 1/4] memblock: Introduce a for_each_reserved_mem_region iterator. Date: Thu, 11 Jul 2013 21:03:52 -0500 Message-Id: <1373594635-131067-2-git-send-email-holt@sgi.com> In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx152.postini.com [74.125.245.152]) by kanga.kvack.org (Postfix) with SMTP id 8E0B26B0034 for ; Thu, 11 Jul 2013 22:04:10 -0400 (EDT) From: Robin Holt Subject: [RFC 2/4] Have __free_pages_memory() free in larger chunks. Date: Thu, 11 Jul 2013 21:03:53 -0500 Message-Id: <1373594635-131067-3-git-send-email-holt@sgi.com> In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased to PTRS_PER_PMD. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/nobootmem.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..3b512ca 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); + end_aligned = end & ~((1UL << order) - 1); if (end_aligned <= start_aligned) { for (i = start; i < end; i++) @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) for (i = start; i < start_aligned; i++) __free_pages_bootmem(pfn_to_page(i), 0); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) + for (i = start_aligned; i < end_aligned; i += 1 << order) __free_pages_bootmem(pfn_to_page(i), order); for (i = end_aligned; i < end; i++) -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx149.postini.com [74.125.245.149]) by kanga.kvack.org (Postfix) with SMTP id 597F46B0036 for ; Thu, 11 Jul 2013 22:04:13 -0400 (EDT) From: Robin Holt Subject: [RFC 3/4] Seperate page initialization into a separate function. Date: Thu, 11 Jul 2013 21:03:54 -0500 Message-Id: <1373594635-131067-4-git-send-email-holt@sgi.com> In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/mm_init.c | 2 +- mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ 2 files changed, 45 insertions(+), 32 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c3edb62..635b131 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) +{ + unsigned long pfn = page_to_pfn(page); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + if (reserved) { + SetPageReserved(page); + } else { + ClearPageReserved(page); + set_page_count(page, 0); + } + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, continue; } page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(page, zone, nid, 1); } } -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id D64156B0037 for ; Thu, 11 Jul 2013 22:04:15 -0400 (EDT) From: Robin Holt Subject: [RFC 4/4] Sparse initialization of struct page array. Date: Thu, 11 Jul 2013 21:03:55 -0500 Message-Id: <1373594635-131067-5-git-send-email-holt@sgi.com> In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/mm.h | 11 +++++ include/linux/page-flags.h | 5 +- mm/nobootmem.c | 5 ++ mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- 4 files changed, 132 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..3de08b5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) __free_page(page); } +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); + +static inline void __reserve_bootmem_page(struct page *page) +{ + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; + phys_addr_t end = start + PAGE_SIZE; + + __reserve_bootmem_region(start, end); +} + static inline void free_reserved_page(struct page *page) { + __reserve_bootmem_page(page); __free_reserved_page(page); adjust_managed_page_count(page, 1); } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..79e8eb7 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 3b512ca..e3a386d 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + __reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) { struct pglist_data *pgdat; + memblock_dump_all(); + for_each_online_pgdat(pgdat) reset_node_lowmem_managed_pages(pgdat); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 635b131..fe51eb9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i #endif } +static void expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int reserved = PageReserved(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2Mib(basepage); + + for( pfn++; pfn < end_pfn; pfn++ ) + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); +} + +void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if(PageUninitialized2Mib(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) +{ + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + ensure_pages_are_initialized(start_pfn, end_pfn); +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + __reserve_bootmem_page(page); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2Mib(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2Mib(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2Mib(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2Mib(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; } @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, unsigned long order; int pages_moved = 0; + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); #ifndef CONFIG_HOLES_IN_ZONE /* * page_zone is not safe to call in this context when @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) return 1; + + if (PageUninitialized2Mib(pfn_to_page(pfn))) + pfn += PTRS_PER_PMD; } return 0; } @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) } } +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, + unsigned long size, int nid) +{ + unsigned long validate_end_pfn = pfn + size; + + if (pfn & (size - 1)) + return 1; + + if (pfn + size >= end_pfn) + return 1; + + while (pfn < validate_end_pfn) + { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + /* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + page = pfn_to_page(pfn); __init_single_page(page, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2Mib(page); + + pfn += pfns; } } @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx185.postini.com [74.125.245.185]) by kanga.kvack.org (Postfix) with SMTP id 3D3E46B0032 for ; Fri, 12 Jul 2013 03:46:01 -0400 (EDT) Date: Fri, 12 Jul 2013 02:45:58 -0500 From: Robin Holt Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. Message-ID: <20130712074558.GP18798@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-3-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1373594635-131067-3-git-send-email-holt@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman After sleeping on this, why can't we change __free_pages_bootmem to not take an order, but rather nr_pages? If we did that, then __free_pages_memory could just calculate nr_pages and call __free_pages_bootmem one time. I don't see why any of the callers of __free_pages_bootmem would not easily support that change. Maybe I will work that up as part of a -v2 and see if it boots/runs. At the very least, I think we could change to: static void __init __free_pages_memory(unsigned long start, unsigned long end) { int order; while (start < end) { order = ffs(start); while (start + (1UL << order) > end) order--; __free_pages_bootmem(start, order); start += (1UL << order); } } Robin On Thu, Jul 11, 2013 at 09:03:53PM -0500, Robin Holt wrote: > Currently, when free_all_bootmem() calls __free_pages_memory(), the > number of contiguous pages that __free_pages_memory() passes to the > buddy allocator is limited to BITS_PER_LONG. In order to be able to > free only the first page of a 2MiB chunk, we need that to be increased > to PTRS_PER_PMD. > > Signed-off-by: Robin Holt > Signed-off-by: Nate Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > --- > mm/nobootmem.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index bdd3fa2..3b512ca 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > unsigned long i, start_aligned, end_aligned; > - int order = ilog2(BITS_PER_LONG); > + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); > > - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); > - end_aligned = end & ~(BITS_PER_LONG - 1); > + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); > + end_aligned = end & ~((1UL << order) - 1); > > if (end_aligned <= start_aligned) { > for (i = start; i < end; i++) > @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) > for (i = start; i < start_aligned; i++) > __free_pages_bootmem(pfn_to_page(i), 0); > > - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) > + for (i = start_aligned; i < end_aligned; i += 1 << order) > __free_pages_bootmem(pfn_to_page(i), order); > > for (i = end_aligned; i < end; i++) > -- > 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx150.postini.com [74.125.245.150]) by kanga.kvack.org (Postfix) with SMTP id 1A8E86B0032 for ; Fri, 12 Jul 2013 04:28:02 -0400 (EDT) Received: by mail-ee0-f45.google.com with SMTP id c1so6074671eek.4 for ; Fri, 12 Jul 2013 01:28:00 -0700 (PDT) Date: Fri, 12 Jul 2013 10:27:56 +0200 From: Ingo Molnar Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130712082756.GA4328@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: Robin Holt , Borislav Petkov , Robert Richter Cc: "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra * Robin Holt wrote: > [...] > > With this patch, we did boot a 16TiB machine. Without the patches, the > v3.10 kernel with the same configuration took 407 seconds for > free_all_bootmem. With the patches and operating on 2MiB pages instead > of 1GiB, it took 26 seconds so performance was improved. I have no feel > for how the 1GiB chunk size will perform. That's pretty impressive. It's still a 15x speedup instead of a 512x speedup, so I'd say there's something else being the current bottleneck, besides page init granularity. Can you boot with just a few gigs of RAM and stuff the rest into hotplug memory, and then hot-add that memory? That would allow easy profiling of remaining overhead. Side note: Robert Richter and Boris Petkov are working on 'persistent events' support for perf, which will eventually allow boot time profiling - I'm not sure if the patches and the tooling support is ready enough yet for your purposes. Robert, Boris, the following workflow would be pretty intuitive: - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB - we'd get a single (cycles?) event running once the perf subsystem is up and running, with a sampling frequency of 1 KHz, sending profiling trace events to a sufficiently sized profiling buffer of 16 MB per CPU. - once the system reaches SYSTEM_RUNNING, profiling is stopped either automatically - or the user stops it via a new tooling command. - the profiling buffer is extracted into a regular perf.data via a special 'perf record' call or some other, new perf tooling solution/variant. [ Alternatively the kernel could attempt to construct a 'virtual' perf.data from the persistent buffer, available via /sys/debug or elsewhere in /sys - just like the kernel constructs a 'virtual' /proc/kcore, etc. That file could be copied or used directly. ] - from that point on this workflow joins the regular profiling workflow: perf report, perf script et al can be used to analyze the resulting boot profile. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id 05C5A6B0034 for ; Fri, 12 Jul 2013 04:47:32 -0400 (EDT) Date: Fri, 12 Jul 2013 10:47:12 +0200 From: Borislav Petkov Subject: boot tracing Message-ID: <20130712084712.GD24008@pd.tnic> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20130712082756.GA4328@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > Robert Richter and Boris Petkov are working on 'persistent events' > support for perf, which will eventually allow boot time profiling - > I'm not sure if the patches and the tooling support is ready enough > yet for your purposes. Nope, not yet but we're getting there. > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB What does perf=boot mean? I assume boot tracing. If so, does it mean we want to enable *all* tracepoints and collect whatever hits us? What makes more sense to me is to hijack what the function tracer does - i.e. simply collect all function calls. > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. Right, what would those trace events be? > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. Ok. > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] Yeah, that. > - from that point on this workflow joins the regular profiling workflow: > perf report, perf script et al can be used to analyze the resulting > boot profile. Agreed. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx137.postini.com [74.125.245.137]) by kanga.kvack.org (Postfix) with SMTP id BC0936B0034 for ; Fri, 12 Jul 2013 04:53:45 -0400 (EDT) Received: by mail-ee0-f54.google.com with SMTP id t10so6083663eei.41 for ; Fri, 12 Jul 2013 01:53:44 -0700 (PDT) Date: Fri, 12 Jul 2013 10:53:41 +0200 From: Ingo Molnar Subject: Re: boot tracing Message-ID: <20130712085341.GC4328@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> <20130712084712.GD24008@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130712084712.GD24008@pd.tnic> Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov Cc: Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra * Borislav Petkov wrote: > On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > Robert Richter and Boris Petkov are working on 'persistent events' > > support for perf, which will eventually allow boot time profiling - > > I'm not sure if the patches and the tooling support is ready enough > > yet for your purposes. > > Nope, not yet but we're getting there. > > > Robert, Boris, the following workflow would be pretty intuitive: > > > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > What does perf=boot mean? I assume boot tracing. In this case it would mean boot profiling - i.e. a cycles hardware-PMU event collecting into a perf trace buffer as usual. Essentially a 'perf record -a' work-alike, just one that gets activated as early as practical, and which would allow the profiling of memory initialization. Now, one extra complication here is that to be able to profile buddy allocator this persistent event would have to work before the buddy allocator is active :-/ So this sort of profiling would have to use memblock_alloc(). Just wanted to highlight this usecase, we might eventually want to support it. [ Note that this is different from boot tracing of one or more trace events - but it's a conceptually pretty close cousin. ] Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx191.postini.com [74.125.245.191]) by kanga.kvack.org (Postfix) with SMTP id B56AA6B0032 for ; Fri, 12 Jul 2013 05:19:18 -0400 (EDT) Received: by mail-wg0-f47.google.com with SMTP id l18so8006693wgh.14 for ; Fri, 12 Jul 2013 02:19:17 -0700 (PDT) Date: Fri, 12 Jul 2013 10:19:09 +0100 From: Robert Richter Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130712091909.GC8731@rric.localhost> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130712082756.GA4328@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Robin Holt , Borislav Petkov , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra On 12.07.13 10:27:56, Ingo Molnar wrote: > > * Robin Holt wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. > > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. > > Side note: > > Robert Richter and Boris Petkov are working on 'persistent events' support > for perf, which will eventually allow boot time profiling - I'm not sure > if the patches and the tooling support is ready enough yet for your > purposes. The latest patch set is still this: git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v2 It requires the perf subsystem to be initialized first which might be too late, see perf_event_init() in start_kernel(). The patch set is currently also limited to tracepoints only. If this is sufficient for you, you might register persistent events with the function perf_add_persistent_event_by_id(), see mcheck_init_tp() how to do this. Later you can fetch all samples with: # perf record -e persistent// sleep 1 > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. I am not sure about the event you want to setup here, if it is a tracepoint the sample_period should be always 1. The buffer size parameter looks interesting, for now it is 512kB per cpu per default (as perf tools setup the buffer). > > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. > > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. See the perf-record command above... > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] > > - from that point on this workflow joins the regular profiling workflow: > perf report, perf script et al can be used to analyze the resulting > boot profile. Ingo, thanks for outlining this workflow. We will look how this could fit into the new version of persistent events we currently working on. Thanks, -Robert -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx111.postini.com [74.125.245.111]) by kanga.kvack.org (Postfix) with SMTP id 0E3B56B0031 for ; Fri, 12 Jul 2013 23:06:53 -0400 (EDT) Received: by mail-ie0-f176.google.com with SMTP id ar20so22430957iec.35 for ; Fri, 12 Jul 2013 20:06:53 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1373594635-131067-4-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-4-git-send-email-holt@sgi.com> Date: Fri, 12 Jul 2013 20:06:52 -0700 Message-ID: Subject: Re: [RFC 3/4] Seperate page initialization into a separate function. From: Yinghai Lu Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > Currently, memmap_init_zone() has all the smarts for initializing a > single page. When we convert to initializing pages in a 2MiB chunk, > we will need to do this equivalent work from two separate places > so we are breaking out a helper function. > > Signed-off-by: Robin Holt > Signed-off-by: Nate Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > --- > mm/mm_init.c | 2 +- > mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ > 2 files changed, 45 insertions(+), 32 deletions(-) > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index c280a02..be8a539 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) > BUG_ON(or_mask != add_mask); > } > > -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, > +void mminit_verify_page_links(struct page *page, enum zone_type zone, > unsigned long nid, unsigned long pfn) > { > BUG_ON(page_to_nid(page) != nid); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c3edb62..635b131 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > spin_unlock(&zone->lock); > } > > +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) > +{ > + unsigned long pfn = page_to_pfn(page); > + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > + > + set_page_links(page, zone, nid, pfn); > + mminit_verify_page_links(page, zone, nid, pfn); > + init_page_count(page); > + page_mapcount_reset(page); > + page_nid_reset_last(page); > + if (reserved) { > + SetPageReserved(page); > + } else { > + ClearPageReserved(page); > + set_page_count(page, 0); > + } > + /* > + * Mark the block movable so that blocks are reserved for > + * movable at startup. This will force kernel allocations > + * to reserve their blocks rather than leaking throughout > + * the address space during boot when many long-lived > + * kernel allocations are made. Later some blocks near > + * the start are marked MIGRATE_RESERVE by > + * setup_zone_migrate_reserve() > + * > + * bitmap is created for zone's valid pfn range. but memmap > + * can be created for invalid pages (for alignment) > + * check here not to call set_pageblock_migratetype() against > + * pfn out of zone. > + */ > + if ((z->zone_start_pfn <= pfn) > + && (pfn < zone_end_pfn(z)) > + && !(pfn & (pageblock_nr_pages - 1))) > + set_pageblock_migratetype(page, MIGRATE_MOVABLE); > + > + INIT_LIST_HEAD(&page->lru); > +#ifdef WANT_PAGE_VIRTUAL > + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > + if (!is_highmem_idx(zone)) > + set_page_address(page, __va(pfn << PAGE_SHIFT)); > +#endif > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > continue; > } > page = pfn_to_page(pfn); > - set_page_links(page, zone, nid, pfn); > - mminit_verify_page_links(page, zone, nid, pfn); > - init_page_count(page); > - page_mapcount_reset(page); > - page_nid_reset_last(page); > - SetPageReserved(page); > - /* > - * Mark the block movable so that blocks are reserved for > - * movable at startup. This will force kernel allocations > - * to reserve their blocks rather than leaking throughout > - * the address space during boot when many long-lived > - * kernel allocations are made. Later some blocks near > - * the start are marked MIGRATE_RESERVE by > - * setup_zone_migrate_reserve() > - * > - * bitmap is created for zone's valid pfn range. but memmap > - * can be created for invalid pages (for alignment) > - * check here not to call set_pageblock_migratetype() against > - * pfn out of zone. > - */ > - if ((z->zone_start_pfn <= pfn) > - && (pfn < zone_end_pfn(z)) > - && !(pfn & (pageblock_nr_pages - 1))) > - set_pageblock_migratetype(page, MIGRATE_MOVABLE); > - > - INIT_LIST_HEAD(&page->lru); > -#ifdef WANT_PAGE_VIRTUAL > - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > - if (!is_highmem_idx(zone)) > - set_page_address(page, __va(pfn << PAGE_SHIFT)); > -#endif > + __init_single_page(page, zone, nid, 1); Can you move page = pfn_to_page(pfn) into __init_single_page and pass pfn directly? Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx185.postini.com [74.125.245.185]) by kanga.kvack.org (Postfix) with SMTP id 0BAB56B0031 for ; Fri, 12 Jul 2013 23:08:56 -0400 (EDT) Received: by mail-ie0-f169.google.com with SMTP id 10so21999640ied.14 for ; Fri, 12 Jul 2013 20:08:56 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20130712074558.GP18798@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-3-git-send-email-holt@sgi.com> <20130712074558.GP18798@sgi.com> Date: Fri, 12 Jul 2013 20:08:56 -0700 Message-ID: Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. From: Yinghai Lu Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Fri, Jul 12, 2013 at 12:45 AM, Robin Holt wrote: > At the very least, I think we could change to: > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > int order; > > while (start < end) { > order = ffs(start); > > while (start + (1UL << order) > end) > order--; > > __free_pages_bootmem(start, order); > > start += (1UL << order); > } > } should work, but need to make sure order < MAX_ORDER. Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx173.postini.com [74.125.245.173]) by kanga.kvack.org (Postfix) with SMTP id 8FC1E6B0031 for ; Sat, 13 Jul 2013 00:19:13 -0400 (EDT) Received: by mail-ie0-f176.google.com with SMTP id ar20so21258721iec.7 for ; Fri, 12 Jul 2013 21:19:13 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1373594635-131067-5-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> Date: Fri, 12 Jul 2013 21:19:12 -0700 Message-ID: Subject: Re: [RFC 4/4] Sparse initialization of struct page array. From: Yinghai Lu Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > During boot of large memory machines, a significant portion of boot > is spent initializing the struct page array. The vast majority of > those pages are not referenced during boot. > > Change this over to only initializing the pages when they are > actually allocated. > > Besides the advantage of boot speed, this allows us the chance to > use normal performance monitoring tools to determine where the bulk > of time is spent during page initialization. > > Signed-off-by: Robin Holt > Signed-off-by: Nate Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > --- > include/linux/mm.h | 11 +++++ > include/linux/page-flags.h | 5 +- > mm/nobootmem.c | 5 ++ > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > 4 files changed, 132 insertions(+), 6 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..3de08b5 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > __free_page(page); > } > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > + > +static inline void __reserve_bootmem_page(struct page *page) > +{ > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > + phys_addr_t end = start + PAGE_SIZE; > + > + __reserve_bootmem_region(start, end); > +} > + > static inline void free_reserved_page(struct page *page) > { > + __reserve_bootmem_page(page); > __free_reserved_page(page); > adjust_managed_page_count(page, 1); > } > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 6d53675..79e8eb7 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -83,6 +83,7 @@ enum pageflags { > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > PG_arch_1, > PG_reserved, > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > PG_private, /* If pagecache, has fs-private data */ > PG_private_2, /* If pagecache, has fs aux data */ > PG_writeback, /* Page is under writeback */ > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > __PAGEFLAG(SlobFree, slob_free) > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > + > /* > * Private page markings that may be used by the filesystem that owns the page > * for its own purposes. > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > #define PAGE_FLAGS_CHECK_AT_FREE \ > (1 << PG_lru | 1 << PG_locked | \ > 1 << PG_private | 1 << PG_private_2 | \ > - 1 << PG_writeback | 1 << PG_reserved | \ > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > __PG_COMPOUND_LOCK) > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 3b512ca..e3a386d 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + __reserve_bootmem_region(start, end); > + How about holes that is not in memblock.reserved? before this patch: free_area_init_node/free_area_init_core/memmap_init_zone will mark all page in node range to Reserved in struct page, at first. but those holes is not mapped via kernel low mapping. so it should be ok not touch "struct page" for them. Now you only mark reserved for memblock.reserved at first, and later mark {memblock.memory} - { memblock.reserved} to be available. And that is ok. but should split that change to another patch and add some comment and change log for the change. in case there is some user like UEFI etc do weird thing. > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > { > struct pglist_data *pgdat; > > + memblock_dump_all(); > + Not needed. > for_each_online_pgdat(pgdat) > reset_node_lowmem_managed_pages(pgdat); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 635b131..fe51eb9 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > #endif > } > > +static void expand_page_initialization(struct page *basepage) > +{ > + unsigned long pfn = page_to_pfn(basepage); > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > + unsigned long zone = page_zonenum(basepage); > + int reserved = PageReserved(basepage); > + int nid = page_to_nid(basepage); > + > + ClearPageUninitialized2Mib(basepage); > + > + for( pfn++; pfn < end_pfn; pfn++ ) > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > +} > + > +void ensure_pages_are_initialized(unsigned long start_pfn, > + unsigned long end_pfn) > +{ > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > + unsigned long aligned_end_pfn; > + struct page *page; > + > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > + aligned_end_pfn += PTRS_PER_PMD; > + while (aligned_start_pfn < aligned_end_pfn) { > + if (pfn_valid(aligned_start_pfn)) { > + page = pfn_to_page(aligned_start_pfn); > + > + if(PageUninitialized2Mib(page)) > + expand_page_initialization(page); > + } > + > + aligned_start_pfn += PTRS_PER_PMD; > + } > +} > + > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > +{ > + unsigned long start_pfn = PFN_DOWN(start); > + unsigned long end_pfn = PFN_UP(end); > + > + ensure_pages_are_initialized(start_pfn, end_pfn); > +} that name is confusing, actually it is setting to struct page to Reserved only. maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? > + > +static inline void ensure_page_is_initialized(struct page *page) > +{ > + __reserve_bootmem_page(page); > +} how about use __reserve_page_bootmem directly and add comment in callers site ? > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > if (PageAnon(page)) > page->mapping = NULL; > for (i = 0; i < (1 << order); i++) > - bad += free_pages_check(page + i); > + if (PageUninitialized2Mib(page + i)) > + i += PTRS_PER_PMD - 1; > + else > + bad += free_pages_check(page + i); > if (bad) > return false; > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > unsigned int loop; > > prefetchw(page); > - for (loop = 0; loop < nr_pages; loop++) { > + for (loop = 0; loop < nr_pages; ) { > struct page *p = &page[loop]; > > if (loop + 1 < nr_pages) > prefetchw(p + 1); > + > + if ((PageUninitialized2Mib(p)) && > + ((loop + PTRS_PER_PMD) > nr_pages)) > + ensure_page_is_initialized(p); > + > __ClearPageReserved(p); > set_page_count(p, 0); > + if (PageUninitialized2Mib(p)) > + loop += PTRS_PER_PMD; > + else > + loop += 1; > } > > page_zone(page)->managed_pages += 1 << order; > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > area--; > high--; > size >>= 1; > + ensure_page_is_initialized(page); > VM_BUG_ON(bad_range(zone, &page[size])); > > #ifdef CONFIG_DEBUG_PAGEALLOC > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > for (i = 0; i < (1 << order); i++) { > struct page *p = page + i; > + > + if (PageUninitialized2Mib(p)) > + expand_page_initialization(page); > + > if (unlikely(check_new_page(p))) > return 1; > } > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > unsigned long order; > int pages_moved = 0; > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > #ifndef CONFIG_HOLES_IN_ZONE > /* > * page_zone is not safe to call in this context when > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > return 1; > + > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > + pfn += PTRS_PER_PMD; > } > return 0; > } > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > } > } > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > + unsigned long size, int nid) why not use static ? it seems there is not outside user. > +{ > + unsigned long validate_end_pfn = pfn + size; > + > + if (pfn & (size - 1)) > + return 1; > + > + if (pfn + size >= end_pfn) > + return 1; > + > + while (pfn < validate_end_pfn) > + { > + if (!early_pfn_valid(pfn)) > + return 1; > + if (!early_pfn_in_nid(pfn, nid)) > + return 1; > + pfn++; > + } > + > + return size; > +} > + > /* > * Initially all pages are reserved - free ones are freed > * up by free_all_bootmem() once the early boot process is > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > highest_memmap_pfn = end_pfn - 1; > > z = &NODE_DATA(nid)->node_zones[zone]; > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > + for (pfn = start_pfn; pfn < end_pfn; ) { > /* > * There can be holes in boot-time mem_map[]s > * handed to this function. They do not > * exist on hotplugged memory. > */ > + int pfns = 1; > if (context == MEMMAP_EARLY) { > - if (!early_pfn_valid(pfn)) > + if (!early_pfn_valid(pfn)) { > + pfn++; > continue; > - if (!early_pfn_in_nid(pfn, nid)) > + } > + if (!early_pfn_in_nid(pfn, nid)) { > + pfn++; > continue; > + } > + > + pfns = pfn_range_init_avail(pfn, end_pfn, > + PTRS_PER_PMD, nid); > } maybe could update memmap_init_zone() only iterate {memblock.memory} - {memblock.reserved}, so you do need to check avail .... as memmap_init_zone do not need to handle holes to mark reserve for them. > + > page = pfn_to_page(pfn); > __init_single_page(page, zone, nid, 1); > + > + if (pfns > 1) > + SetPageUninitialized2Mib(page); > + > + pfn += pfns; > } > } > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > {1UL << PG_owner_priv_1, "owner_priv_1" }, > {1UL << PG_arch_1, "arch_1" }, > {1UL << PG_reserved, "reserved" }, > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, PG_uninitialized_2m ? > {1UL << PG_private, "private" }, > {1UL << PG_private_2, "private_2" }, > {1UL << PG_writeback, "writeback" }, Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx116.postini.com [74.125.245.116]) by kanga.kvack.org (Postfix) with SMTP id AAA516B0031 for ; Sat, 13 Jul 2013 00:39:50 -0400 (EDT) Message-ID: <51E0DA05.4090107@zytor.com> Date: Fri, 12 Jul 2013 21:39:33 -0700 From: "H. Peter Anvin" MIME-Version: 1.0 Subject: Re: [RFC 4/4] Sparse initialization of struct page array. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Yinghai Lu Cc: Robin Holt , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On 07/12/2013 09:19 PM, Yinghai Lu wrote: >> PG_reserved, >> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >> PG_private, /* If pagecache, has fs-private data */ The comment here is WTF... -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx146.postini.com [74.125.245.146]) by kanga.kvack.org (Postfix) with SMTP id 0B14E6B0031 for ; Sat, 13 Jul 2013 01:31:16 -0400 (EDT) Received: by mail-ie0-f169.google.com with SMTP id 10so22850766ied.0 for ; Fri, 12 Jul 2013 22:31:16 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <51E0DA05.4090107@zytor.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <51E0DA05.4090107@zytor.com> Date: Fri, 12 Jul 2013 22:31:16 -0700 Message-ID: Subject: Re: [RFC 4/4] Sparse initialization of struct page array. From: Yinghai Lu Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" Cc: Robin Holt , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin wrote: > On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>> PG_reserved, >>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>> PG_private, /* If pagecache, has fs-private data */ > > The comment here is WTF... ntz: Nate Zimmer? rmh: Robin Holt? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx178.postini.com [74.125.245.178]) by kanga.kvack.org (Postfix) with SMTP id 23C2A6B0031 for ; Sat, 13 Jul 2013 01:39:12 -0400 (EDT) Message-ID: <51E0E7ED.7040801@zytor.com> Date: Fri, 12 Jul 2013 22:38:53 -0700 From: "H. Peter Anvin" MIME-Version: 1.0 Subject: Re: [RFC 4/4] Sparse initialization of struct page array. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <51E0DA05.4090107@zytor.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Yinghai Lu Cc: Robin Holt , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On 07/12/2013 10:31 PM, Yinghai Lu wrote: > On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin wrote: >> On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>>> PG_reserved, >>>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>>> PG_private, /* If pagecache, has fs-private data */ >> >> The comment here is WTF... > > ntz: Nate Zimmer? > rmh: Robin Holt? > This kind of conversation doesn't really belong in a code comment, though. -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx125.postini.com [74.125.245.125]) by kanga.kvack.org (Postfix) with SMTP id 22F526B0089 for ; Sun, 14 Jul 2013 21:38:52 -0400 (EDT) Received: by mail-ie0-f170.google.com with SMTP id e11so25031448iej.15 for ; Sun, 14 Jul 2013 18:38:51 -0700 (PDT) Message-ID: <51E3529F.6070909@gmail.com> Date: Mon, 15 Jul 2013 09:38:39 +0800 From: Sam Ben MIME-Version: 1.0 Subject: Re: boot tracing References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> <20130712084712.GD24008@pd.tnic> <20130712085341.GC4328@gmail.com> In-Reply-To: <20130712085341.GC4328@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Borislav Petkov , Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra On 07/12/2013 04:53 PM, Ingo Molnar wrote: > * Borislav Petkov wrote: > >> On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: >>> Robert Richter and Boris Petkov are working on 'persistent events' >>> support for perf, which will eventually allow boot time profiling - >>> I'm not sure if the patches and the tooling support is ready enough >>> yet for your purposes. >> Nope, not yet but we're getting there. >> >>> Robert, Boris, the following workflow would be pretty intuitive: >>> >>> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB >> What does perf=boot mean? I assume boot tracing. > In this case it would mean boot profiling - i.e. a cycles hardware-PMU > event collecting into a perf trace buffer as usual. > > Essentially a 'perf record -a' work-alike, just one that gets activated as > early as practical, and which would allow the profiling of memory > initialization. > > Now, one extra complication here is that to be able to profile buddy > allocator this persistent event would have to work before the buddy > allocator is active :-/ So this sort of profiling would have to use > memblock_alloc(). Could perf=boot be used to sample the performance of memblock subsystem? I think the perf subsystem is too late to be initialized and monitor this. > > Just wanted to highlight this usecase, we might eventually want to support > it. > > [ Note that this is different from boot tracing of one or more trace > events - but it's a conceptually pretty close cousin. ] > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx174.postini.com [74.125.245.174]) by kanga.kvack.org (Postfix) with SMTP id 0D00F6B0098 for ; Sun, 14 Jul 2013 23:19:35 -0400 (EDT) Date: Sun, 14 Jul 2013 22:19:33 -0500 From: Robin Holt Subject: Re: [RFC 3/4] Seperate page initialization into a separate function. Message-ID: <20130715031932.GA31581@gulag1.americas.sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-4-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Yinghai Lu Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Fri, Jul 12, 2013 at 08:06:52PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > > Currently, memmap_init_zone() has all the smarts for initializing a > > single page. When we convert to initializing pages in a 2MiB chunk, > > we will need to do this equivalent work from two separate places > > so we are breaking out a helper function. > > > > Signed-off-by: Robin Holt > > Signed-off-by: Nate Zimmer > > To: "H. Peter Anvin" > > To: Ingo Molnar > > Cc: Linux Kernel > > Cc: Linux MM > > Cc: Rob Landley > > Cc: Mike Travis > > Cc: Daniel J Blueman > > Cc: Andrew Morton > > Cc: Greg KH > > Cc: Yinghai Lu > > Cc: Mel Gorman > > --- > > mm/mm_init.c | 2 +- > > mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ > > 2 files changed, 45 insertions(+), 32 deletions(-) > > > > diff --git a/mm/mm_init.c b/mm/mm_init.c > > index c280a02..be8a539 100644 > > --- a/mm/mm_init.c > > +++ b/mm/mm_init.c > > @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) > > BUG_ON(or_mask != add_mask); > > } > > > > -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, > > +void mminit_verify_page_links(struct page *page, enum zone_type zone, > > unsigned long nid, unsigned long pfn) > > { > > BUG_ON(page_to_nid(page) != nid); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index c3edb62..635b131 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > > spin_unlock(&zone->lock); > > } > > > > +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) > > +{ > > + unsigned long pfn = page_to_pfn(page); > > + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > > + > > + set_page_links(page, zone, nid, pfn); > > + mminit_verify_page_links(page, zone, nid, pfn); > > + init_page_count(page); > > + page_mapcount_reset(page); > > + page_nid_reset_last(page); > > + if (reserved) { > > + SetPageReserved(page); > > + } else { > > + ClearPageReserved(page); > > + set_page_count(page, 0); > > + } > > + /* > > + * Mark the block movable so that blocks are reserved for > > + * movable at startup. This will force kernel allocations > > + * to reserve their blocks rather than leaking throughout > > + * the address space during boot when many long-lived > > + * kernel allocations are made. Later some blocks near > > + * the start are marked MIGRATE_RESERVE by > > + * setup_zone_migrate_reserve() > > + * > > + * bitmap is created for zone's valid pfn range. but memmap > > + * can be created for invalid pages (for alignment) > > + * check here not to call set_pageblock_migratetype() against > > + * pfn out of zone. > > + */ > > + if ((z->zone_start_pfn <= pfn) > > + && (pfn < zone_end_pfn(z)) > > + && !(pfn & (pageblock_nr_pages - 1))) > > + set_pageblock_migratetype(page, MIGRATE_MOVABLE); > > + > > + INIT_LIST_HEAD(&page->lru); > > +#ifdef WANT_PAGE_VIRTUAL > > + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > > + if (!is_highmem_idx(zone)) > > + set_page_address(page, __va(pfn << PAGE_SHIFT)); > > +#endif > > +} > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > continue; > > } > > page = pfn_to_page(pfn); > > - set_page_links(page, zone, nid, pfn); > > - mminit_verify_page_links(page, zone, nid, pfn); > > - init_page_count(page); > > - page_mapcount_reset(page); > > - page_nid_reset_last(page); > > - SetPageReserved(page); > > - /* > > - * Mark the block movable so that blocks are reserved for > > - * movable at startup. This will force kernel allocations > > - * to reserve their blocks rather than leaking throughout > > - * the address space during boot when many long-lived > > - * kernel allocations are made. Later some blocks near > > - * the start are marked MIGRATE_RESERVE by > > - * setup_zone_migrate_reserve() > > - * > > - * bitmap is created for zone's valid pfn range. but memmap > > - * can be created for invalid pages (for alignment) > > - * check here not to call set_pageblock_migratetype() against > > - * pfn out of zone. > > - */ > > - if ((z->zone_start_pfn <= pfn) > > - && (pfn < zone_end_pfn(z)) > > - && !(pfn & (pageblock_nr_pages - 1))) > > - set_pageblock_migratetype(page, MIGRATE_MOVABLE); > > - > > - INIT_LIST_HEAD(&page->lru); > > -#ifdef WANT_PAGE_VIRTUAL > > - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > > - if (!is_highmem_idx(zone)) > > - set_page_address(page, __va(pfn << PAGE_SHIFT)); > > -#endif > > + __init_single_page(page, zone, nid, 1); > > Can you > move page = pfn_to_page(pfn) into __init_single_page > and pass pfn directly? Sure, but then I don't care for the name so much, but I can think on that some too. I think the feedback I was most hoping to receive was pertaining to a means for removing the PG_uninitialized2Mib flag entirely. If I could get rid of that and have some page-local way of knowing if it has not been initialized, I think this patch set would be much better. Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx156.postini.com [74.125.245.156]) by kanga.kvack.org (Postfix) with SMTP id ACD206B0039 for ; Mon, 15 Jul 2013 10:08:52 -0400 (EDT) Message-ID: <51E40272.6000806@sgi.com> Date: Mon, 15 Jul 2013 09:08:50 -0500 From: Nathan Zimmer MIME-Version: 1.0 Subject: Re: [RFC 4/4] Sparse initialization of struct page array. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <51E0DA05.4090107@zytor.com> In-Reply-To: Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Yinghai Lu Cc: "H. Peter Anvin" , Robin Holt , Ingo Molnar , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On 07/13/2013 12:31 AM, Yinghai Lu wrote: > On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin wrote: >> On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>>> PG_reserved, >>>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>>> PG_private, /* If pagecache, has fs-private data */ >> The comment here is WTF... > ntz: Nate Zimmer? > rmh: Robin Holt? Yea that comment was supposed to be removed. Sorry about that. Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx193.postini.com [74.125.245.193]) by kanga.kvack.org (Postfix) with SMTP id A5B686B0031 for ; Mon, 15 Jul 2013 11:00:41 -0400 (EDT) Date: Mon, 15 Jul 2013 10:00:40 -0500 From: Robin Holt Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130715150040.GA3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman On Thu, Jul 11, 2013 at 09:03:51PM -0500, Robin Holt wrote: > We have been working on this since we returned from shutdown and have > something to discuss now. We restricted ourselves to 2MiB initialization > to keep the patch set a little smaller and more clear. > > First, I think I want to propose getting rid of the page flag. If I knew > of a concrete way to determine that the page has not been initialized, > this patch series would look different. If there is no definitive > way to determine that the struct page has been initialized aside from > checking the entire page struct is zero, then I think I would suggest > we change the page flag to indicate the page has been initialized. Ingo or HPA, Did I implement this wrong or is there a way to get rid of the page flag which is not going to impact normal operation? I don't want to put too much more effort into this until I know we are stuck going this direction. Currently, the expand() function has a relatively expensive checked against the 2MiB aligned pfn's struct page. I do not know of a way to eliminate that check against the other page as the first reference we see for a page is in the middle of that 2MiB aligned range. To identify this as an area of concern, we had booted with a simulator, setting watch points on the struct page array region once the Uninitialized flag was set and maintaining that until it was cleared. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx192.postini.com [74.125.245.192]) by kanga.kvack.org (Postfix) with SMTP id 57A5C6B0034 for ; Mon, 15 Jul 2013 11:16:25 -0400 (EDT) Date: Mon, 15 Jul 2013 10:16:23 -0500 From: Robin Holt Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130715151623.GB3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130712082756.GA4328@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Robin Holt , Borislav Petkov , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > * Robin Holt wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. And WRONG! That is a 15x speedup in the freeing of memory at the free_all_bootmem point. That is _NOT_ the speedup from memmap_init_zone. I forgot to take that into account as Nate pointed out this morning in a hallway discussion. Before, on the 16TiB machine, memmap_init_zone took 1152 seconds. After, it took 50. If it were a straight 1/512th, we would have expected that 1152 to be something more on the line of 2-3 so there is still significant room for improvement. Sorry for the confusion. > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. Nate and I will be working on other things for the next few hours hoping there is a better answer to the first question we asked about there being a way to test a page other than comparing against all zeroes to see if it has been initialized. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id DCDA36B0032 for ; Mon, 15 Jul 2013 13:45:57 -0400 (EDT) Date: Mon, 15 Jul 2013 12:45:55 -0500 From: Nathan Zimmer Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130715174551.GA58640@asylum.americas.sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Yinghai Lu Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? > > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai I hadn't actually been very happy with having a PG_uninitialized2mib flag. It implies if we want to jump to 1Gb pages we would need a second flag, PG_uninitialized1gb, for that. I was thinking of changing it to PG_uninitialized and setting page->private to the correct order. Thoughts? Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx133.postini.com [74.125.245.133]) by kanga.kvack.org (Postfix) with SMTP id 8BD7C6B0033 for ; Mon, 15 Jul 2013 13:54:46 -0400 (EDT) Message-ID: <51E4375E.1010704@zytor.com> Date: Mon, 15 Jul 2013 10:54:38 -0700 From: "H. Peter Anvin" MIME-Version: 1.0 Subject: Re: [RFC 4/4] Sparse initialization of struct page array. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> In-Reply-To: <20130715174551.GA58640@asylum.americas.sgi.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Nathan Zimmer Cc: Yinghai Lu , Robin Holt , Ingo Molnar , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On 07/15/2013 10:45 AM, Nathan Zimmer wrote: > > I hadn't actually been very happy with having a PG_uninitialized2mib flag. > It implies if we want to jump to 1Gb pages we would need a second flag, > PG_uninitialized1gb, for that. I was thinking of changing it to > PG_uninitialized and setting page->private to the correct order. > Thoughts? > Seems straightforward. The bigger issue is the amount of overhead we cause by having to check upstack for the initialization status of the superpages. I'm concerned, obviously, about lingering overhead that is "forever". That being said, in the absolutely worst case we could have a counter to the number of uninitialized pages which when it hits zero we do a static switch and switch out the initialization code (would have to be undone on memory hotplug, of course.) -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id AA8E56B0032 for ; Mon, 15 Jul 2013 14:26:17 -0400 (EDT) Date: Mon, 15 Jul 2013 13:26:15 -0500 From: Robin Holt Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130715182615.GF3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51E4375E.1010704@zytor.com> Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" Cc: Nathan Zimmer , Yinghai Lu , Robin Holt , Ingo Molnar , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Mon, Jul 15, 2013 at 10:54:38AM -0700, H. Peter Anvin wrote: > On 07/15/2013 10:45 AM, Nathan Zimmer wrote: > > > > I hadn't actually been very happy with having a PG_uninitialized2mib flag. > > It implies if we want to jump to 1Gb pages we would need a second flag, > > PG_uninitialized1gb, for that. I was thinking of changing it to > > PG_uninitialized and setting page->private to the correct order. > > Thoughts? > > > > Seems straightforward. The bigger issue is the amount of overhead we > cause by having to check upstack for the initialization status of the > superpages. > > I'm concerned, obviously, about lingering overhead that is "forever". > That being said, in the absolutely worst case we could have a counter to > the number of uninitialized pages which when it hits zero we do a static > switch and switch out the initialization code (would have to be undone > on memory hotplug, of course.) Is there a fairly cheap way to determine definitively that the struct page is not initialized? I think this patch set can change fairly drastically if we have that. I think I will start working up those changes and code a heavy-handed check until I hear of an alternative way to cheaply check. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx178.postini.com [74.125.245.178]) by kanga.kvack.org (Postfix) with SMTP id 56AEA6B0032 for ; Mon, 15 Jul 2013 14:29:45 -0400 (EDT) Message-ID: <51E43F91.1040906@zytor.com> Date: Mon, 15 Jul 2013 11:29:37 -0700 From: "H. Peter Anvin" MIME-Version: 1.0 Subject: Re: [RFC 4/4] Sparse initialization of struct page array. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> In-Reply-To: <20130715182615.GF3421@sgi.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Robin Holt Cc: Nathan Zimmer , Yinghai Lu , Ingo Molnar , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On 07/15/2013 11:26 AM, Robin Holt wrote: > Is there a fairly cheap way to determine definitively that the struct > page is not initialized? By definition I would assume no. The only way I can think of would be to unmap the memory associated with the struct page in the TLB and initialize the struct pages at trap time. > I think this patch set can change fairly drastically if we have that. > I think I will start working up those changes and code a heavy-handed > check until I hear of an alternative way to cheaply check. -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx127.postini.com [74.125.245.127]) by kanga.kvack.org (Postfix) with SMTP id 922B06B0031 for ; Mon, 15 Jul 2013 17:30:39 -0400 (EDT) Date: Mon, 15 Jul 2013 14:30:37 -0700 From: Andrew Morton Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-Id: <20130715143037.8287ffbf2fb0e72bc8efb287@linux-foundation.org> In-Reply-To: <1373594635-131067-5-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Greg KH , Yinghai Lu , Mel Gorman On Thu, 11 Jul 2013 21:03:55 -0500 Robin Holt wrote: > During boot of large memory machines, a significant portion of boot > is spent initializing the struct page array. The vast majority of > those pages are not referenced during boot. > > Change this over to only initializing the pages when they are > actually allocated. > > Besides the advantage of boot speed, this allows us the chance to > use normal performance monitoring tools to determine where the bulk > of time is spent during page initialization. > > ... > > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > __free_page(page); > } > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > + > +static inline void __reserve_bootmem_page(struct page *page) > +{ > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > + phys_addr_t end = start + PAGE_SIZE; > + > + __reserve_bootmem_region(start, end); > +} It isn't obvious that this needed to be inlined? > static inline void free_reserved_page(struct page *page) > { > + __reserve_bootmem_page(page); > __free_reserved_page(page); > adjust_managed_page_count(page, 1); > } > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 6d53675..79e8eb7 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -83,6 +83,7 @@ enum pageflags { > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > PG_arch_1, > PG_reserved, > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ "mib" creeps me out too. And it makes me think of SNMP, which I'd prefer not to think about. We've traditionally had fears of running out of page flags, but I've lost track of how close we are to that happening. IIRC the answer depends on whether you believe there is such a thing as a 32-bit NUMA system. Can this be avoided anyway? I suspect there's some idiotic combination of flags we could use to indicate the state. PG_reserved|PG_lru or something. "2MB" sounds terribly arch-specific. Shouldn't we make it more generic for when the hexagon64 port wants to use 4MB? That conversational code comment was already commented on, but it's still there? > > ... > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > #endif > } > > +static void expand_page_initialization(struct page *basepage) > +{ > + unsigned long pfn = page_to_pfn(basepage); > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > + unsigned long zone = page_zonenum(basepage); > + int reserved = PageReserved(basepage); > + int nid = page_to_nid(basepage); > + > + ClearPageUninitialized2Mib(basepage); > + > + for( pfn++; pfn < end_pfn; pfn++ ) > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > +} > + > +void ensure_pages_are_initialized(unsigned long start_pfn, > + unsigned long end_pfn) I think this can be made static. I hope so, as it's a somewhat odd-sounding identifier for a global. > +{ > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > + unsigned long aligned_end_pfn; > + struct page *page; > + > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > + aligned_end_pfn += PTRS_PER_PMD; > + while (aligned_start_pfn < aligned_end_pfn) { > + if (pfn_valid(aligned_start_pfn)) { > + page = pfn_to_page(aligned_start_pfn); > + > + if(PageUninitialized2Mib(page)) checkpatch them, please. > + expand_page_initialization(page); > + } > + > + aligned_start_pfn += PTRS_PER_PMD; > + } > +} Some nice code comments for the above two functions would be helpful. > > ... > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > + unsigned long size, int nid) > +{ > + unsigned long validate_end_pfn = pfn + size; > + > + if (pfn & (size - 1)) > + return 1; > + > + if (pfn + size >= end_pfn) > + return 1; > + > + while (pfn < validate_end_pfn) > + { > + if (!early_pfn_valid(pfn)) > + return 1; > + if (!early_pfn_in_nid(pfn, nid)) > + return 1; > + pfn++; > + } > + > + return size; > +} Document it, please. The return value semantics look odd, so don't forget to explain all that as well. > > ... > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > {1UL << PG_owner_priv_1, "owner_priv_1" }, > {1UL << PG_arch_1, "arch_1" }, > {1UL << PG_reserved, "reserved" }, > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, It would be better if the name which is visible in procfs matches the name in the kernel source code. > {1UL << PG_private, "private" }, > {1UL << PG_private_2, "private_2" }, > {1UL << PG_writeback, "writeback" }, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx112.postini.com [74.125.245.112]) by kanga.kvack.org (Postfix) with SMTP id 4E5F16B0032 for ; Tue, 16 Jul 2013 04:55:00 -0400 (EDT) Date: Tue, 16 Jul 2013 17:55:02 +0900 From: Joonsoo Kim Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130716085502.GA31276@lge.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130712082756.GA4328@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Robin Holt , Borislav Petkov , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > * Robin Holt wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. > > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. > > Side note: > > Robert Richter and Boris Petkov are working on 'persistent events' support > for perf, which will eventually allow boot time profiling - I'm not sure > if the patches and the tooling support is ready enough yet for your > purposes. > > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. > > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. > > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] Hello, Robert, Boris, Ingo. How about executing a perf in usermodehelper and collecting output in tmpfs? Using this approach, we can start a perf after rootfs initialization, because we need a perf binary at least. But we can use almost functionality of perf. If anyone have interest with this approach, I will send patches implementing this idea. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx161.postini.com [74.125.245.161]) by kanga.kvack.org (Postfix) with SMTP id 29B5B6B0032 for ; Tue, 16 Jul 2013 05:08:31 -0400 (EDT) Date: Tue, 16 Jul 2013 11:08:05 +0200 From: Borislav Petkov Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130716090805.GC4402@pd.tnic> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> <20130716085502.GA31276@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20130716085502.GA31276@lge.com> Sender: owner-linux-mm@kvack.org List-ID: To: Joonsoo Kim Cc: Ingo Molnar , Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: > How about executing a perf in usermodehelper and collecting output > in tmpfs? Using this approach, we can start a perf after rootfs > initialization, What for if we can start logging to buffers much earlier? *Reading* from those buffers can be done much later, at our own leisure with full userspace up. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx109.postini.com [74.125.245.109]) by kanga.kvack.org (Postfix) with SMTP id D19816B0032 for ; Tue, 16 Jul 2013 06:26:19 -0400 (EDT) Date: Tue, 16 Jul 2013 05:26:15 -0500 From: Robin Holt Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130716102615.GG3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Yinghai Lu Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > Signed-off-by: Robin Holt > > Signed-off-by: Nate Zimmer > > To: "H. Peter Anvin" > > To: Ingo Molnar > > Cc: Linux Kernel > > Cc: Linux MM > > Cc: Rob Landley > > Cc: Mike Travis > > Cc: Daniel J Blueman > > Cc: Andrew Morton > > Cc: Greg KH > > Cc: Yinghai Lu > > Cc: Mel Gorman > > --- > > include/linux/mm.h | 11 +++++ > > include/linux/page-flags.h | 5 +- > > mm/nobootmem.c | 5 ++ > > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > > 4 files changed, 132 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index e0c8528..3de08b5 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > + > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > PG_private, /* If pagecache, has fs-private data */ > > PG_private_2, /* If pagecache, has fs aux data */ > > PG_writeback, /* Page is under writeback */ > > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > > > __PAGEFLAG(SlobFree, slob_free) > > > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > > + > > /* > > * Private page markings that may be used by the filesystem that owns the page > > * for its own purposes. > > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > > #define PAGE_FLAGS_CHECK_AT_FREE \ > > (1 << PG_lru | 1 << PG_locked | \ > > 1 << PG_private | 1 << PG_private_2 | \ > > - 1 << PG_writeback | 1 << PG_reserved | \ > > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > > __PG_COMPOUND_LOCK) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > > index 3b512ca..e3a386d 100644 > > --- a/mm/nobootmem.c > > +++ b/mm/nobootmem.c > > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > > phys_addr_t start, end, size; > > u64 i; > > > > + for_each_reserved_mem_region(i, &start, &end) > > + __reserve_bootmem_region(start, end); > > + > > How about holes that is not in memblock.reserved? > > before this patch: > free_area_init_node/free_area_init_core/memmap_init_zone > will mark all page in node range to Reserved in struct page, at first. > > but those holes is not mapped via kernel low mapping. > so it should be ok not touch "struct page" for them. > > Now you only mark reserved for memblock.reserved at first, and later > mark {memblock.memory} - { memblock.reserved} to be available. > And that is ok. > > but should split that change to another patch and add some comment > and change log for the change. > in case there is some user like UEFI etc do weird thing. I will split out a separate patch for this. > > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > > count += __free_memory_core(start, end); > > > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > > { > > struct pglist_data *pgdat; > > > > + memblock_dump_all(); > > + > > Not needed. Left over debug in the rush to ask our question. > > for_each_online_pgdat(pgdat) > > reset_node_lowmem_managed_pages(pgdat); > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 635b131..fe51eb9 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > + > > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > > +{ > > + unsigned long start_pfn = PFN_DOWN(start); > > + unsigned long end_pfn = PFN_UP(end); > > + > > + ensure_pages_are_initialized(start_pfn, end_pfn); > > +} > > that name is confusing, actually it is setting to struct page to Reserved only. > maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? Done. > > + > > +static inline void ensure_page_is_initialized(struct page *page) > > +{ > > + __reserve_bootmem_page(page); > > +} > > how about use __reserve_page_bootmem directly and add comment in callers site ? I really dislike that. The inline function makes the need for a comment unnecessary in my opinion and leaves the implementation localized to the one-line function. Those wanting to understand why can quickly see that the intended functionality is accomplished by the other function. I would really like to leave this as-is. > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > > if (PageAnon(page)) > > page->mapping = NULL; > > for (i = 0; i < (1 << order); i++) > > - bad += free_pages_check(page + i); > > + if (PageUninitialized2Mib(page + i)) > > + i += PTRS_PER_PMD - 1; > > + else > > + bad += free_pages_check(page + i); > > if (bad) > > return false; > > > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > > unsigned int loop; > > > > prefetchw(page); > > - for (loop = 0; loop < nr_pages; loop++) { > > + for (loop = 0; loop < nr_pages; ) { > > struct page *p = &page[loop]; > > > > if (loop + 1 < nr_pages) > > prefetchw(p + 1); > > + > > + if ((PageUninitialized2Mib(p)) && > > + ((loop + PTRS_PER_PMD) > nr_pages)) > > + ensure_page_is_initialized(p); > > + > > __ClearPageReserved(p); > > set_page_count(p, 0); > > + if (PageUninitialized2Mib(p)) > > + loop += PTRS_PER_PMD; > > + else > > + loop += 1; > > } > > > > page_zone(page)->managed_pages += 1 << order; > > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > > area--; > > high--; > > size >>= 1; > > + ensure_page_is_initialized(page); > > VM_BUG_ON(bad_range(zone, &page[size])); > > > > #ifdef CONFIG_DEBUG_PAGEALLOC > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > } > > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > > unsigned long order; > > int pages_moved = 0; > > > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > > #ifndef CONFIG_HOLES_IN_ZONE > > /* > > * page_zone is not safe to call in this context when > > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > > return 1; > > + > > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > > + pfn += PTRS_PER_PMD; > > } > > return 0; > > } > > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > > } > > } > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > why not use static ? it seems there is not outside user. Left over from early patch series where we were using this from mm/nobootmem.c. Fixed. > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > + > > /* > > * Initially all pages are reserved - free ones are freed > > * up by free_all_bootmem() once the early boot process is > > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > highest_memmap_pfn = end_pfn - 1; > > > > z = &NODE_DATA(nid)->node_zones[zone]; > > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > + for (pfn = start_pfn; pfn < end_pfn; ) { > > /* > > * There can be holes in boot-time mem_map[]s > > * handed to this function. They do not > > * exist on hotplugged memory. > > */ > > + int pfns = 1; > > if (context == MEMMAP_EARLY) { > > - if (!early_pfn_valid(pfn)) > > + if (!early_pfn_valid(pfn)) { > > + pfn++; > > continue; > > - if (!early_pfn_in_nid(pfn, nid)) > > + } > > + if (!early_pfn_in_nid(pfn, nid)) { > > + pfn++; > > continue; > > + } > > + > > + pfns = pfn_range_init_avail(pfn, end_pfn, > > + PTRS_PER_PMD, nid); > > } > > maybe could update memmap_init_zone() only iterate {memblock.memory} - > {memblock.reserved}, so you do need to check avail .... > > as memmap_init_zone do not need to handle holes to mark reserve for them. Maybe I can change pfn_range_init_avail in such a way that the __reserve_pages_bootmem() work above is not needed. I will dig into that some more before the next patch submission. > > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? Done. > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx161.postini.com [74.125.245.161]) by kanga.kvack.org (Postfix) with SMTP id 1A51B6B0032 for ; Tue, 16 Jul 2013 06:39:00 -0400 (EDT) Date: Tue, 16 Jul 2013 05:38:58 -0500 From: Robin Holt Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130716103857.GH3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715143037.8287ffbf2fb0e72bc8efb287@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130715143037.8287ffbf2fb0e72bc8efb287@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Greg KH , Yinghai Lu , Mel Gorman On Mon, Jul 15, 2013 at 02:30:37PM -0700, Andrew Morton wrote: > On Thu, 11 Jul 2013 21:03:55 -0500 Robin Holt wrote: > > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > ... > > > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > It isn't obvious that this needed to be inlined? It is being declared in a header file. All the other functions I came across in that header file are declared as inline (or __always_inline). It feels to me like this is right. Can I leave it as-is? > > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > "mib" creeps me out too. And it makes me think of SNMP, which I'd > prefer not to think about. > > We've traditionally had fears of running out of page flags, but I've > lost track of how close we are to that happening. IIRC the answer > depends on whether you believe there is such a thing as a 32-bit NUMA > system. > > Can this be avoided anyway? I suspect there's some idiotic combination > of flags we could use to indicate the state. PG_reserved|PG_lru or > something. > > "2MB" sounds terribly arch-specific. Shouldn't we make it more generic > for when the hexagon64 port wants to use 4MB? > > That conversational code comment was already commented on, but it's > still there? I am going to work on making it non-2m based over the course of this week, so expect the _2m (current name based on Yinghai's comments) to go away entirely. > > > > ... > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > I think this can be made static. I hope so, as it's a somewhat > odd-sounding identifier for a global. Done. > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > checkpatch them, please. Will certainly do. > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > Some nice code comments for the above two functions would be helpful. Will do. > > > > ... > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > Document it, please. The return value semantics look odd, so don't > forget to explain all that as well. Will do. Will also work on the name to make it more clear what we are returning. > > > > ... > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > It would be better if the name which is visible in procfs matches the > name in the kernel source code. Done and will try to maintain the consistency. > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx111.postini.com [74.125.245.111]) by kanga.kvack.org (Postfix) with SMTP id 2ED1D6B0032 for ; Tue, 16 Jul 2013 09:03:03 -0400 (EDT) Received: by mail-pb0-f42.google.com with SMTP id un1so679670pbc.15 for ; Tue, 16 Jul 2013 06:03:02 -0700 (PDT) Message-ID: <51E5447D.70901@gmail.com> Date: Tue, 16 Jul 2013 21:02:53 +0800 From: Sam Ben MIME-Version: 1.0 Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-3-git-send-email-holt@sgi.com> In-Reply-To: <1373594635-131067-3-git-send-email-holt@sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Hi Robin, On 07/12/2013 10:03 AM, Robin Holt wrote: > Currently, when free_all_bootmem() calls __free_pages_memory(), the > number of contiguous pages that __free_pages_memory() passes to the > buddy allocator is limited to BITS_PER_LONG. In order to be able to I fail to understand this. Why the original page number is BITS_PER_LONG? > free only the first page of a 2MiB chunk, we need that to be increased > to PTRS_PER_PMD. > > Signed-off-by: Robin Holt > Signed-off-by: Nate Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > --- > mm/nobootmem.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index bdd3fa2..3b512ca 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > unsigned long i, start_aligned, end_aligned; > - int order = ilog2(BITS_PER_LONG); > + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); > > - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); > - end_aligned = end & ~(BITS_PER_LONG - 1); > + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); > + end_aligned = end & ~((1UL << order) - 1); > > if (end_aligned <= start_aligned) { > for (i = start; i < end; i++) > @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) > for (i = start; i < start_aligned; i++) > __free_pages_bootmem(pfn_to_page(i), 0); > > - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) > + for (i = start_aligned; i < end_aligned; i += 1 << order) > __free_pages_bootmem(pfn_to_page(i), order); > > for (i = end_aligned; i < end; i++) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id E813B6B0032 for ; Wed, 17 Jul 2013 01:18:00 -0400 (EDT) Received: by mail-oa0-f46.google.com with SMTP id h1so1958090oag.33 for ; Tue, 16 Jul 2013 22:18:00 -0700 (PDT) Message-ID: <51E628F8.6030303@gmail.com> Date: Wed, 17 Jul 2013 13:17:44 +0800 From: Sam Ben MIME-Version: 1.0 Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1373594635-131067-1-git-send-email-holt@sgi.com> In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman On 07/12/2013 10:03 AM, Robin Holt wrote: > We have been working on this since we returned from shutdown and have > something to discuss now. We restricted ourselves to 2MiB initialization > to keep the patch set a little smaller and more clear. > > First, I think I want to propose getting rid of the page flag. If I knew > of a concrete way to determine that the page has not been initialized, > this patch series would look different. If there is no definitive > way to determine that the struct page has been initialized aside from > checking the entire page struct is zero, then I think I would suggest > we change the page flag to indicate the page has been initialized. > > The heart of the problem as I see it comes from expand(). We nearly > always see a first reference to a struct page which is in the middle > of the 2MiB region. Due to that access, the unlikely() check that was > originally proposed really ends up referencing a different page entirely. > We actually did not introduce an unlikely and refactor the patches to > make that unlikely inside a static inline function. Also, given the > strong warning at the head of expand(), we did not feel experienced > enough to refactor it to make things always reference the 2MiB page > first. > > With this patch, we did boot a 16TiB machine. Without the patches, > the v3.10 kernel with the same configuration took 407 seconds for > free_all_bootmem. With the patches and operating on 2MiB pages instead > of 1GiB, it took 26 seconds so performance was improved. I have no feel > for how the 1GiB chunk size will perform. How to test how much time spend on free_all_bootmem? > > I am on vacation for the next three days so I am sorry in advance for > my infrequent or non-existant responses. > > > Signed-off-by: Robin Holt > Signed-off-by: Nate Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx109.postini.com [74.125.245.109]) by kanga.kvack.org (Postfix) with SMTP id DB74F6B0032 for ; Wed, 17 Jul 2013 05:30:54 -0400 (EDT) Date: Wed, 17 Jul 2013 04:30:52 -0500 From: Robin Holt Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130717093051.GK3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <51E628F8.6030303@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51E628F8.6030303@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Sam Ben Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: > On 07/12/2013 10:03 AM, Robin Holt wrote: > >We have been working on this since we returned from shutdown and have > >something to discuss now. We restricted ourselves to 2MiB initialization > >to keep the patch set a little smaller and more clear. > > > >First, I think I want to propose getting rid of the page flag. If I knew > >of a concrete way to determine that the page has not been initialized, > >this patch series would look different. If there is no definitive > >way to determine that the struct page has been initialized aside from > >checking the entire page struct is zero, then I think I would suggest > >we change the page flag to indicate the page has been initialized. > > > >The heart of the problem as I see it comes from expand(). We nearly > >always see a first reference to a struct page which is in the middle > >of the 2MiB region. Due to that access, the unlikely() check that was > >originally proposed really ends up referencing a different page entirely. > >We actually did not introduce an unlikely and refactor the patches to > >make that unlikely inside a static inline function. Also, given the > >strong warning at the head of expand(), we did not feel experienced > >enough to refactor it to make things always reference the 2MiB page > >first. > > > >With this patch, we did boot a 16TiB machine. Without the patches, > >the v3.10 kernel with the same configuration took 407 seconds for > >free_all_bootmem. With the patches and operating on 2MiB pages instead > >of 1GiB, it took 26 seconds so performance was improved. I have no feel > >for how the 1GiB chunk size will perform. > > How to test how much time spend on free_all_bootmem? We had put a pr_emerg at the beginning and end of free_all_bootmem and then used a modified version of script which record the time in uSecs at the beginning of each line of output. Robin > > > > >I am on vacation for the next three days so I am sorry in advance for > >my infrequent or non-existant responses. > > > > > >Signed-off-by: Robin Holt > >Signed-off-by: Nate Zimmer > >To: "H. Peter Anvin" > >To: Ingo Molnar > >Cc: Linux Kernel > >Cc: Linux MM > >Cc: Rob Landley > >Cc: Mike Travis > >Cc: Daniel J Blueman > >Cc: Andrew Morton > >Cc: Greg KH > >Cc: Yinghai Lu > >Cc: Mel Gorman > >-- > >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >the body of a message to majordomo@vger.kernel.org > >More majordomo info at http://vger.kernel.org/majordomo-info.html > >Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx117.postini.com [74.125.245.117]) by kanga.kvack.org (Postfix) with SMTP id E00FC6B0031 for ; Fri, 19 Jul 2013 19:51:50 -0400 (EDT) Received: by mail-ie0-f180.google.com with SMTP id f4so10415651iea.25 for ; Fri, 19 Jul 2013 16:51:50 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20130717093051.GK3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <51E628F8.6030303@gmail.com> <20130717093051.GK3421@sgi.com> Date: Fri, 19 Jul 2013 16:51:49 -0700 Message-ID: Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator From: Yinghai Lu Content-Type: multipart/mixed; boundary=90e6ba6e901ca9348804e1e602df Sender: owner-linux-mm@kvack.org List-ID: To: Robin Holt Cc: Sam Ben , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman --90e6ba6e901ca9348804e1e602df Content-Type: text/plain; charset=ISO-8859-1 On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt wrote: > On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: >> >With this patch, we did boot a 16TiB machine. Without the patches, >> >the v3.10 kernel with the same configuration took 407 seconds for >> >free_all_bootmem. With the patches and operating on 2MiB pages instead >> >of 1GiB, it took 26 seconds so performance was improved. I have no feel >> >for how the 1GiB chunk size will perform. >> >> How to test how much time spend on free_all_bootmem? > > We had put a pr_emerg at the beginning and end of free_all_bootmem and > then used a modified version of script which record the time in uSecs > at the beginning of each line of output. used two patches, found 3TiB system will take 100s before slub is ready. about three portions: 1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those struct page area take about 30s. 2. memmap_init_zone: take about 25s 3. mem_init/free_all_bootmem about 30s. so still wonder why 16TiB will need hours. also your patches looks like only address 2 and 3. Yinghai --90e6ba6e901ca9348804e1e602df Content-Type: application/octet-stream; name="printk_time_tsc_0.patch" Content-Disposition: attachment; filename="printk_time_tsc_0.patch" Content-Transfer-Encoding: base64 X-Attachment-Id: f_hjc1alqv0 U3ViamVjdDogW1BBVENIXSBwcmludGtfdGltZTogcHJlcGFyZSBzdHViIGZvciB1c2luZyBvdGhl ciB0aGFuIGNwdV9jbG9jawoKLXYyOiByZWZyZXNoIHRvIHYzLjEwCgpTaWduZWQtb2ZmLWJ5OiBZ aW5naGFpIEx1IDx5aGx1Lmtlcm5lbEBnbWFpbC5jb20+CgotLS0KIGluY2x1ZGUvbGludXgvcHJp bnRrLmggfCAgICAzICsrKwogaW5pdC9tYWluLmMgICAgICAgICAgICB8ICAgIDEgKwoga2VybmVs L3ByaW50ay5jICAgICAgICB8ICAgMTcgKysrKysrKysrKysrKysrLS0KIDMgZmlsZXMgY2hhbmdl ZCwgMTkgaW5zZXJ0aW9ucygrKSwgMiBkZWxldGlvbnMoLSkKCkluZGV4OiBsaW51eC0yLjYvaW5p dC9tYWluLmMKPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PQotLS0gbGludXgtMi42Lm9yaWcvaW5pdC9tYWluLmMKKysrIGxp bnV4LTIuNi9pbml0L21haW4uYwpAQCAtNjAxLDYgKzYwMSw3IEBAIGFzbWxpbmthZ2Ugdm9pZCBf X2luaXQgc3RhcnRfa2VybmVsKHZvaWQKIAkJbGF0ZV90aW1lX2luaXQoKTsKIAlzY2hlZF9jbG9j a19pbml0KCk7CiAJY2FsaWJyYXRlX2RlbGF5KCk7CisJc2V0X3ByaW50a190aW1lX2Nsb2NrKGxv Y2FsX2Nsb2NrKTsKIAlwaWRtYXBfaW5pdCgpOwogCWFub25fdm1hX2luaXQoKTsKICNpZmRlZiBD T05GSUdfWDg2CkluZGV4OiBsaW51eC0yLjYva2VybmVsL3ByaW50ay5jCj09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KLS0t IGxpbnV4LTIuNi5vcmlnL2tlcm5lbC9wcmludGsuYworKysgbGludXgtMi42L2tlcm5lbC9wcmlu dGsuYwpAQCAtMjE3LDYgKzIxNywxOSBAQCBzdHJ1Y3QgbG9nIHsKIHN0YXRpYyBERUZJTkVfUkFX X1NQSU5MT0NLKGxvZ2J1Zl9sb2NrKTsKIAogI2lmZGVmIENPTkZJR19QUklOVEsKKworc3RhdGlj IHU2NCBkZWZhdWx0X3ByaW50a190aW1lX2Nsb2NrKHZvaWQpCit7CisJcmV0dXJuIDA7Cit9CisK K3N0YXRpYyB1NjQgKCpwcmludGtfdGltZV9jbG9jaykodm9pZCkgPSBkZWZhdWx0X3ByaW50a190 aW1lX2Nsb2NrOworCit2b2lkIHNldF9wcmludGtfdGltZV9jbG9jayh1NjQgKCpmbikodm9pZCkp Cit7CisJcHJpbnRrX3RpbWVfY2xvY2sgPSBmbjsKK30KKwogREVDTEFSRV9XQUlUX1FVRVVFX0hF QUQobG9nX3dhaXQpOwogLyogdGhlIG5leHQgcHJpbnRrIHJlY29yZCB0byByZWFkIGJ5IHN5c2xv ZyhSRUFEKSBvciAvcHJvYy9rbXNnICovCiBzdGF0aWMgdTY0IHN5c2xvZ19zZXE7CkBAIC0zNTQs NyArMzY3LDcgQEAgc3RhdGljIHZvaWQgbG9nX3N0b3JlKGludCBmYWNpbGl0eSwgaW50CiAJaWYg KHRzX25zZWMgPiAwKQogCQltc2ctPnRzX25zZWMgPSB0c19uc2VjOwogCWVsc2UKLQkJbXNnLT50 c19uc2VjID0gbG9jYWxfY2xvY2soKTsKKwkJbXNnLT50c19uc2VjID0gcHJpbnRrX3RpbWVfY2xv Y2soKTsKIAltZW1zZXQobG9nX2RpY3QobXNnKSArIGRpY3RfbGVuLCAwLCBwYWRfbGVuKTsKIAlt c2ctPmxlbiA9IHNpemVvZihzdHJ1Y3QgbG9nKSArIHRleHRfbGVuICsgZGljdF9sZW4gKyBwYWRf bGVuOwogCkBAIC0xNDU3LDcgKzE0NzAsNyBAQCBzdGF0aWMgYm9vbCBjb250X2FkZChpbnQgZmFj aWxpdHksIGludCBsCiAJCWNvbnQuZmFjaWxpdHkgPSBmYWNpbGl0eTsKIAkJY29udC5sZXZlbCA9 IGxldmVsOwogCQljb250Lm93bmVyID0gY3VycmVudDsKLQkJY29udC50c19uc2VjID0gbG9jYWxf Y2xvY2soKTsKKwkJY29udC50c19uc2VjID0gcHJpbnRrX3RpbWVfY2xvY2soKTsKIAkJY29udC5m bGFncyA9IDA7CiAJCWNvbnQuY29ucyA9IDA7CiAJCWNvbnQuZmx1c2hlZCA9IGZhbHNlOwpJbmRl eDogbGludXgtMi42L2luY2x1ZGUvbGludXgvcHJpbnRrLmgKPT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQotLS0gbGludXgt Mi42Lm9yaWcvaW5jbHVkZS9saW51eC9wcmludGsuaAorKysgbGludXgtMi42L2luY2x1ZGUvbGlu dXgvcHJpbnRrLmgKQEAgLTEwNyw2ICsxMDcsNyBAQCB2b2lkIGVhcmx5X3ByaW50ayhjb25zdCBj aGFyICpzLCAuLi4pIHsKICNlbmRpZgogCiAjaWZkZWYgQ09ORklHX1BSSU5USworZXh0ZXJuIHZv aWQgc2V0X3ByaW50a190aW1lX2Nsb2NrKHU2NCAoKmZuKSh2b2lkKSk7CiBhc21saW5rYWdlIF9f cHJpbnRmKDUsIDApCiBpbnQgdnByaW50a19lbWl0KGludCBmYWNpbGl0eSwgaW50IGxldmVsLAog CQkgY29uc3QgY2hhciAqZGljdCwgc2l6ZV90IGRpY3RsZW4sCkBAIC0xNTAsNiArMTUxLDggQEAg dm9pZCBkdW1wX3N0YWNrX3NldF9hcmNoX2Rlc2MoY29uc3QgY2hhcgogdm9pZCBkdW1wX3N0YWNr X3ByaW50X2luZm8oY29uc3QgY2hhciAqbG9nX2x2bCk7CiB2b2lkIHNob3dfcmVnc19wcmludF9p bmZvKGNvbnN0IGNoYXIgKmxvZ19sdmwpOwogI2Vsc2UKK3N0YXRpYyBpbmxpbmUgdm9pZCBzZXRf cHJpbnRrX3RpbWVfY2xvY2sodTY0ICgqZm4pKHZvaWQpKSB7IH0KKwogc3RhdGljIGlubGluZSBf X3ByaW50ZigxLCAwKQogaW50IHZwcmludGsoY29uc3QgY2hhciAqcywgdmFfbGlzdCBhcmdzKQog ewo= --90e6ba6e901ca9348804e1e602df Content-Type: application/octet-stream; name="printk_time_tsc_1.patch" Content-Disposition: attachment; filename="printk_time_tsc_1.patch" Content-Transfer-Encoding: base64 X-Attachment-Id: f_hjc1au5d1 U3ViamVjdDogW1BBVENIXSB4ODY6IHByaW50a190aW1lIHRvIHVzZSB0c2MgYmVmb3JlIGNwdV9j bG9jayBpcyByZWFkeQoKc28gY2FuIGdldCB0c2MgdmFsdWUgb24gcHJpbnRrCgpuZWVkIHRvIGFw cGx5IGFmdGVyCglbUEFUQ0hdIHByaW50a190aW1lOiBwcmVwYXJlIHN0dWIgZm9yIHVzaW5nIG90 aGVyIHRoYW4gY3B1X2Nsb2NrCgotdjI6IHJlZnJlc2ggb24gdjMuMTAKClNpZ25lZC1vZmYtYnk6 IFlpbmdoYWkgTHUgPHlpbmdoYWlAa2VybmVsLm9yZz4KCi0tLQogYXJjaC94ODYva2VybmVsL2Nw dS9jb21tb24uYyB8ICAgMjEgKysrKysrKysrKysrKysrKysrKysrCiAxIGZpbGUgY2hhbmdlZCwg MjEgaW5zZXJ0aW9ucygrKQoKSW5kZXg6IGxpbnV4LTIuNi9hcmNoL3g4Ni9rZXJuZWwvY3B1L2Nv bW1vbi5jCj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT0KLS0tIGxpbnV4LTIuNi5vcmlnL2FyY2gveDg2L2tlcm5lbC9jcHUv Y29tbW9uLmMKKysrIGxpbnV4LTIuNi9hcmNoL3g4Ni9rZXJuZWwvY3B1L2NvbW1vbi5jCkBAIC03 MzcsNiArNzM3LDI1IEBAIHN0YXRpYyB2b2lkIF9faW5pdCBlYXJseV9pZGVudGlmeV9jcHUoc3QK IAlzZXR1cF9mb3JjZV9jcHVfY2FwKFg4Nl9GRUFUVVJFX0FMV0FZUyk7CiB9CiAKK3N0YXRpYyB1 NjQgX19pbml0ZGF0YSB0c2NfY2xvY2tfYmFzZTsKKworc3RhdGljIF9faW5pdCB1NjQgdHNjX2Ns b2NrX29mZnNldCh2b2lkKQoreworCXVuc2lnbmVkIGxvbmcgbG9uZyB0OworCisJcmR0c2NsbCh0 KTsKKworCXJldHVybiB0IC0gdHNjX2Nsb2NrX2Jhc2U7Cit9CisKK3N0YXRpYyBfX2luaXQgdm9p ZCBlYXJseV9wcmludGtfdGltZV9pbml0KHZvaWQpCit7CisJaWYgKGNwdV9oYXNfdHNjKSB7CisJ CXJkdHNjbGwodHNjX2Nsb2NrX2Jhc2UpOworCQlzZXRfcHJpbnRrX3RpbWVfY2xvY2sodHNjX2Ns b2NrX29mZnNldCk7CisJfQorfQorCiB2b2lkIF9faW5pdCBlYXJseV9jcHVfaW5pdCh2b2lkKQog ewogCWNvbnN0IHN0cnVjdCBjcHVfZGV2ICpjb25zdCAqY2RldjsKQEAgLTc2MSw2ICs3ODAsOCBA QCB2b2lkIF9faW5pdCBlYXJseV9jcHVfaW5pdCh2b2lkKQogCX0KIAogCWVhcmx5X2lkZW50aWZ5 X2NwdSgmYm9vdF9jcHVfZGF0YSk7CisKKwllYXJseV9wcmludGtfdGltZV9pbml0KCk7CiB9CiAK IC8qCg== --90e6ba6e901ca9348804e1e602df-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx106.postini.com [74.125.245.106]) by kanga.kvack.org (Postfix) with SMTP id 3136F6B0032 for ; Mon, 22 Jul 2013 02:13:44 -0400 (EDT) Date: Mon, 22 Jul 2013 01:13:39 -0500 From: Robin Holt Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130722061339.GC3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <51E628F8.6030303@gmail.com> <20130717093051.GK3421@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Yinghai Lu Cc: Robin Holt , Sam Ben , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Fri, Jul 19, 2013 at 04:51:49PM -0700, Yinghai Lu wrote: > On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt wrote: > > On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: > >> >With this patch, we did boot a 16TiB machine. Without the patches, > >> >the v3.10 kernel with the same configuration took 407 seconds for > >> >free_all_bootmem. With the patches and operating on 2MiB pages instead > >> >of 1GiB, it took 26 seconds so performance was improved. I have no feel > >> >for how the 1GiB chunk size will perform. > >> > >> How to test how much time spend on free_all_bootmem? > > > > We had put a pr_emerg at the beginning and end of free_all_bootmem and > > then used a modified version of script which record the time in uSecs > > at the beginning of each line of output. > > used two patches, found 3TiB system will take 100s before slub is ready. > > about three portions: > 1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those > struct page area take about 30s. > 2. memmap_init_zone: take about 25s > 3. mem_init/free_all_bootmem about 30s. > > so still wonder why 16TiB will need hours. I don't know where you got the figure of hours for memory initialization. That is likely for a 32TiB boot and includes the entire boot, not just getting the memory allocator initialized. For a 16 TiB boot: 1) 344 2) 1151 3) 407 I hope that illustrates why we chose to address the memmap_init_zone first which had the nice side effect of also impacting the free_all_bootmem slowdown. With these patches, those numbers are currently: 1) 344 2) 49 3) 26 > also your patches looks like only address 2 and 3. Right, but I thought that was the normal way to do things. Address one thing at a time and work toward a better kernel. I don't see a relationship between the work we are doing here and the sparse vmemmap buffer allocation. Have I missed something? Did you happen to time a boot with these patches applied to see how long it took and how much impact they had on a smaller config? Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx108.postini.com [74.125.245.108]) by kanga.kvack.org (Postfix) with SMTP id 67AF76B0032 for ; Tue, 23 Jul 2013 04:19:01 -0400 (EDT) Received: by mail-ee0-f52.google.com with SMTP id c50so4319253eek.39 for ; Tue, 23 Jul 2013 01:18:59 -0700 (PDT) Date: Tue, 23 Jul 2013 10:18:56 +0200 From: Ingo Molnar Subject: Re: boot tracing Message-ID: <20130723081856.GC16088@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> <20130712084712.GD24008@pd.tnic> <20130712085341.GC4328@gmail.com> <51E3529F.6070909@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51E3529F.6070909@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Sam Ben Cc: Borislav Petkov , Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra * Sam Ben wrote: > On 07/12/2013 04:53 PM, Ingo Molnar wrote: > >* Borislav Petkov wrote: > > > >>On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > >>>Robert Richter and Boris Petkov are working on 'persistent events' > >>>support for perf, which will eventually allow boot time profiling - > >>>I'm not sure if the patches and the tooling support is ready enough > >>>yet for your purposes. > >>Nope, not yet but we're getting there. > >> > >>>Robert, Boris, the following workflow would be pretty intuitive: > >>> > >>> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > >>What does perf=boot mean? I assume boot tracing. > >In this case it would mean boot profiling - i.e. a cycles hardware-PMU > >event collecting into a perf trace buffer as usual. > > > >Essentially a 'perf record -a' work-alike, just one that gets activated as > >early as practical, and which would allow the profiling of memory > >initialization. > > > >Now, one extra complication here is that to be able to profile buddy > >allocator this persistent event would have to work before the buddy > >allocator is active :-/ So this sort of profiling would have to use > >memblock_alloc(). > > Could perf=boot be used to sample the performance of memblock subsystem? > I think the perf subsystem is too late to be initialized and monitor > this. Yes, that would be a useful facility as well, for things with many events were printk is not necessarily practical. Any tracepoint could be utilized. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx149.postini.com [74.125.245.149]) by kanga.kvack.org (Postfix) with SMTP id 6B2296B0032 for ; Tue, 23 Jul 2013 04:20:44 -0400 (EDT) Received: by mail-ee0-f41.google.com with SMTP id d51so53433eek.0 for ; Tue, 23 Jul 2013 01:20:42 -0700 (PDT) Date: Tue, 23 Jul 2013 10:20:39 +0200 From: Ingo Molnar Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130723082039.GD16088@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> <20130716085502.GA31276@lge.com> <20130716090805.GC4402@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130716090805.GC4402@pd.tnic> Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov Cc: Joonsoo Kim , Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra * Borislav Petkov wrote: > On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: > > How about executing a perf in usermodehelper and collecting output > > in tmpfs? Using this approach, we can start a perf after rootfs > > initialization, > > What for if we can start logging to buffers much earlier? *Reading* > from those buffers can be done much later, at our own leisure with full > userspace up. Yeah, agreed, I think this needs to be more integrated into the kernel, so that people don't have to worry about "when does userspace start up the earliest" details. Fundamentally all perf really needs here is some memory to initialize and buffer into. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx143.postini.com [74.125.245.143]) by kanga.kvack.org (Postfix) with SMTP id 25C4B6B0032 for ; Tue, 23 Jul 2013 04:32:16 -0400 (EDT) Received: by mail-ea0-f169.google.com with SMTP id h15so4408111eak.28 for ; Tue, 23 Jul 2013 01:32:14 -0700 (PDT) Date: Tue, 23 Jul 2013 10:32:11 +0200 From: Ingo Molnar Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130723083211.GE16088@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> <51E43F91.1040906@zytor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51E43F91.1040906@zytor.com> Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" Cc: Robin Holt , Nathan Zimmer , Yinghai Lu , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman * H. Peter Anvin wrote: > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > Is there a fairly cheap way to determine definitively that the struct > > page is not initialized? > > By definition I would assume no. The only way I can think of would be > to unmap the memory associated with the struct page in the TLB and > initialize the struct pages at trap time. But ... the only fastpath impact I can see of delayed initialization right now is this piece of logic in prep_new_page(): @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2Mib(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; That is where I think it can be made zero overhead in the already-initialized case, because page-flags are already used in check_new_page(): static inline int check_new_page(struct page *page) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every struct page on allocation. We can micro-optimize that low overhead to zero-overhead, by integrating the PageUninitialized2Mib() check into check_new_page(). This can be done by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { if (PageUninitialized2Mib(p)) expand_page_initialization(page); ... } if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; this will result in making it essentially zero-overhead, the expand_page_initialization() logic is now in a slowpath. Am I missing anything here? Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx153.postini.com [74.125.245.153]) by kanga.kvack.org (Postfix) with SMTP id 3C1D76B0033 for ; Tue, 23 Jul 2013 07:09:50 -0400 (EDT) Date: Tue, 23 Jul 2013 06:09:47 -0500 From: Robin Holt Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130723110947.GF3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> <51E43F91.1040906@zytor.com> <20130723083211.GE16088@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130723083211.GE16088@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: "H. Peter Anvin" , Robin Holt , Nathan Zimmer , Yinghai Lu , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > * H. Peter Anvin wrote: > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > Is there a fairly cheap way to determine definitively that the struct > > > page is not initialized? > > > > By definition I would assume no. The only way I can think of would be > > to unmap the memory associated with the struct page in the TLB and > > initialize the struct pages at trap time. > > But ... the only fastpath impact I can see of delayed initialization right > now is this piece of logic in prep_new_page(): > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > for (i = 0; i < (1 << order); i++) { > struct page *p = page + i; > + > + if (PageUninitialized2Mib(p)) > + expand_page_initialization(page); > + > if (unlikely(check_new_page(p))) > return 1; > > That is where I think it can be made zero overhead in the > already-initialized case, because page-flags are already used in > check_new_page(): The problem I see here is that the page flags we need to check for the uninitialized flag are in the "other" page for the page aligned at the 2MiB virtual address, not the page currently being referenced. Let me try a version of the patch where we set the PG_unintialized_2m flag on all pages, including the aligned pages and see what that does to performance. Robin > > static inline int check_new_page(struct page *page) > { > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > return 1; > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > struct page on allocation. > > We can micro-optimize that low overhead to zero-overhead, by integrating > the PageUninitialized2Mib() check into check_new_page(). This can be done > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > if (PageUninitialized2Mib(p)) > expand_page_initialization(page); > ... > } > > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > > return 1; > > this will result in making it essentially zero-overhead, the > expand_page_initialization() logic is now in a slowpath. > > Am I missing anything here? > > Thanks, > > Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx137.postini.com [74.125.245.137]) by kanga.kvack.org (Postfix) with SMTP id 1D85E6B0032 for ; Tue, 23 Jul 2013 07:15:51 -0400 (EDT) Date: Tue, 23 Jul 2013 06:15:49 -0500 From: Robin Holt Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130723111549.GG3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> <51E43F91.1040906@zytor.com> <20130723083211.GE16088@gmail.com> <20130723110947.GF3421@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130723110947.GF3421@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: "H. Peter Anvin" , Robin Holt , Nathan Zimmer , Yinghai Lu , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman I think the other critical path which is affected is in expand(). There, we just call ensure_page_is_initialized() blindly which does the check against the other page. The below is a nearly zero addition. Sorry for the confusion. My morning coffee has not kicked in yet. Robin On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > * H. Peter Anvin wrote: > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > page is not initialized? > > > > > > By definition I would assume no. The only way I can think of would be > > > to unmap the memory associated with the struct page in the TLB and > > > initialize the struct pages at trap time. > > > > But ... the only fastpath impact I can see of delayed initialization right > > now is this piece of logic in prep_new_page(): > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > > > That is where I think it can be made zero overhead in the > > already-initialized case, because page-flags are already used in > > check_new_page(): > > The problem I see here is that the page flags we need to check for the > uninitialized flag are in the "other" page for the page aligned at the > 2MiB virtual address, not the page currently being referenced. > > Let me try a version of the patch where we set the PG_unintialized_2m > flag on all pages, including the aligned pages and see what that does > to performance. > > Robin > > > > > static inline int check_new_page(struct page *page) > > { > > if (unlikely(page_mapcount(page) | > > (page->mapping != NULL) | > > (atomic_read(&page->_count) != 0) | > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > (mem_cgroup_bad_page_check(page)))) { > > bad_page(page); > > return 1; > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > struct page on allocation. > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > if (PageUninitialized2Mib(p)) > > expand_page_initialization(page); > > ... > > } > > > > if (unlikely(page_mapcount(page) | > > (page->mapping != NULL) | > > (atomic_read(&page->_count) != 0) | > > (mem_cgroup_bad_page_check(page)))) { > > bad_page(page); > > > > return 1; > > > > this will result in making it essentially zero-overhead, the > > expand_page_initialization() logic is now in a slowpath. > > > > Am I missing anything here? > > > > Thanks, > > > > Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx168.postini.com [74.125.245.168]) by kanga.kvack.org (Postfix) with SMTP id C798D6B0032 for ; Tue, 23 Jul 2013 07:41:52 -0400 (EDT) Date: Tue, 23 Jul 2013 06:41:50 -0500 From: Robin Holt Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130723114150.GH3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> <51E43F91.1040906@zytor.com> <20130723083211.GE16088@gmail.com> <20130723110947.GF3421@sgi.com> <20130723111549.GG3421@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130723111549.GG3421@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: "H. Peter Anvin" , Robin Holt , Nathan Zimmer , Yinghai Lu , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Tue, Jul 23, 2013 at 06:15:49AM -0500, Robin Holt wrote: > I think the other critical path which is affected is in expand(). > There, we just call ensure_page_is_initialized() blindly which does > the check against the other page. The below is a nearly zero addition. > Sorry for the confusion. My morning coffee has not kicked in yet. I don't have access to the 16TiB system until Thursday unless the other testing on it fails early. I did boot a 2TiB system with the a change which set the Unitialized_2m flag on all pages in that 2MiB range during memmap_init_zone. That makes the expand check test against the referenced page instead of having to go back to the 2MiB page. It appears to have added less than a second to the 2TiB boot so I hope it has equally little impact to the 16TiB boot. I will clean up this patch some more and resend the currently untested set later today. Thanks, Robin > > Robin > > On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > > > * H. Peter Anvin wrote: > > > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > > page is not initialized? > > > > > > > > By definition I would assume no. The only way I can think of would be > > > > to unmap the memory associated with the struct page in the TLB and > > > > initialize the struct pages at trap time. > > > > > > But ... the only fastpath impact I can see of delayed initialization right > > > now is this piece of logic in prep_new_page(): > > > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > > > for (i = 0; i < (1 << order); i++) { > > > struct page *p = page + i; > > > + > > > + if (PageUninitialized2Mib(p)) > > > + expand_page_initialization(page); > > > + > > > if (unlikely(check_new_page(p))) > > > return 1; > > > > > > That is where I think it can be made zero overhead in the > > > already-initialized case, because page-flags are already used in > > > check_new_page(): > > > > The problem I see here is that the page flags we need to check for the > > uninitialized flag are in the "other" page for the page aligned at the > > 2MiB virtual address, not the page currently being referenced. > > > > Let me try a version of the patch where we set the PG_unintialized_2m > > flag on all pages, including the aligned pages and see what that does > > to performance. > > > > Robin > > > > > > > > static inline int check_new_page(struct page *page) > > > { > > > if (unlikely(page_mapcount(page) | > > > (page->mapping != NULL) | > > > (atomic_read(&page->_count) != 0) | > > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > > (mem_cgroup_bad_page_check(page)))) { > > > bad_page(page); > > > return 1; > > > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > > struct page on allocation. > > > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > > if (PageUninitialized2Mib(p)) > > > expand_page_initialization(page); > > > ... > > > } > > > > > > if (unlikely(page_mapcount(page) | > > > (page->mapping != NULL) | > > > (atomic_read(&page->_count) != 0) | > > > (mem_cgroup_bad_page_check(page)))) { > > > bad_page(page); > > > > > > return 1; > > > > > > this will result in making it essentially zero-overhead, the > > > expand_page_initialization() logic is now in a slowpath. > > > > > > Am I missing anything here? > > > > > > Thanks, > > > > > > Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx180.postini.com [74.125.245.180]) by kanga.kvack.org (Postfix) with SMTP id 3B6576B0032 for ; Tue, 23 Jul 2013 07:50:23 -0400 (EDT) Date: Tue, 23 Jul 2013 06:50:21 -0500 From: Robin Holt Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130723115021.GI3421@sgi.com> References: <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> <51E43F91.1040906@zytor.com> <20130723083211.GE16088@gmail.com> <20130723110947.GF3421@sgi.com> <20130723111549.GG3421@sgi.com> <20130723114150.GH3421@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130723114150.GH3421@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: "H. Peter Anvin" , Robin Holt , Nathan Zimmer , Yinghai Lu , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Tue, Jul 23, 2013 at 06:41:50AM -0500, Robin Holt wrote: > On Tue, Jul 23, 2013 at 06:15:49AM -0500, Robin Holt wrote: > > I think the other critical path which is affected is in expand(). > > There, we just call ensure_page_is_initialized() blindly which does > > the check against the other page. The below is a nearly zero addition. > > Sorry for the confusion. My morning coffee has not kicked in yet. > > I don't have access to the 16TiB system until Thursday unless the other > testing on it fails early. I did boot a 2TiB system with the a change > which set the Unitialized_2m flag on all pages in that 2MiB range > during memmap_init_zone. That makes the expand check test against > the referenced page instead of having to go back to the 2MiB page. > It appears to have added less than a second to the 2TiB boot so I hope > it has equally little impact to the 16TiB boot. I was wrong. One of the two logs I looked at was the wrong one. Setting that Unitialized2m flag on all pages added 30 seconds to the 2TiB boot's memmap_init_zone(). Please disregard. That brings me back to the belief we need a better solution for the expand() path. Robin > > I will clean up this patch some more and resend the currently untested > set later today. > > Thanks, > Robin > > > > > Robin > > > > On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > > > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > > > > > * H. Peter Anvin wrote: > > > > > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > > > page is not initialized? > > > > > > > > > > By definition I would assume no. The only way I can think of would be > > > > > to unmap the memory associated with the struct page in the TLB and > > > > > initialize the struct pages at trap time. > > > > > > > > But ... the only fastpath impact I can see of delayed initialization right > > > > now is this piece of logic in prep_new_page(): > > > > > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > > > > > for (i = 0; i < (1 << order); i++) { > > > > struct page *p = page + i; > > > > + > > > > + if (PageUninitialized2Mib(p)) > > > > + expand_page_initialization(page); > > > > + > > > > if (unlikely(check_new_page(p))) > > > > return 1; > > > > > > > > That is where I think it can be made zero overhead in the > > > > already-initialized case, because page-flags are already used in > > > > check_new_page(): > > > > > > The problem I see here is that the page flags we need to check for the > > > uninitialized flag are in the "other" page for the page aligned at the > > > 2MiB virtual address, not the page currently being referenced. > > > > > > Let me try a version of the patch where we set the PG_unintialized_2m > > > flag on all pages, including the aligned pages and see what that does > > > to performance. > > > > > > Robin > > > > > > > > > > > static inline int check_new_page(struct page *page) > > > > { > > > > if (unlikely(page_mapcount(page) | > > > > (page->mapping != NULL) | > > > > (atomic_read(&page->_count) != 0) | > > > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > > > (mem_cgroup_bad_page_check(page)))) { > > > > bad_page(page); > > > > return 1; > > > > > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > > > struct page on allocation. > > > > > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > > > if (PageUninitialized2Mib(p)) > > > > expand_page_initialization(page); > > > > ... > > > > } > > > > > > > > if (unlikely(page_mapcount(page) | > > > > (page->mapping != NULL) | > > > > (atomic_read(&page->_count) != 0) | > > > > (mem_cgroup_bad_page_check(page)))) { > > > > bad_page(page); > > > > > > > > return 1; > > > > > > > > this will result in making it essentially zero-overhead, the > > > > expand_page_initialization() logic is now in a slowpath. > > > > > > > > Am I missing anything here? > > > > > > > > Thanks, > > > > > > > > Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx178.postini.com [74.125.245.178]) by kanga.kvack.org (Postfix) with SMTP id 236126B0032 for ; Tue, 23 Jul 2013 11:33:15 -0400 (EDT) Date: Tue, 23 Jul 2013 11:32:57 -0400 From: Johannes Weiner Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. Message-ID: <20130723153257.GK715@cmpxchg.org> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-3-git-send-email-holt@sgi.com> <51E5447D.70901@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51E5447D.70901@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Sam Ben Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman On Tue, Jul 16, 2013 at 09:02:53PM +0800, Sam Ben wrote: > Hi Robin, > On 07/12/2013 10:03 AM, Robin Holt wrote: > >Currently, when free_all_bootmem() calls __free_pages_memory(), the > >number of contiguous pages that __free_pages_memory() passes to the > >buddy allocator is limited to BITS_PER_LONG. In order to be able to > > I fail to understand this. Why the original page number is BITS_PER_LONG? The mm/bootmem.c implementation uses a bitmap to keep track of free/reserved pages. It walks that bitmap in BITS_PER_LONG steps because it is the biggest chunk that is still trivial and cheap to check if all pages are free in it (chunk == ~0UL). nobootmem.c was written based on the bootmem.c interface, so it was probably adapted to keep things similar between the two, short of a pressing reason not to. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx132.postini.com [74.125.245.132]) by kanga.kvack.org (Postfix) with SMTP id DF5676B0031 for ; Wed, 24 Jul 2013 22:25:45 -0400 (EDT) Date: Wed, 24 Jul 2013 21:25:43 -0500 From: Robin Holt Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130725022543.GR3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Yinghai Lu Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > Signed-off-by: Robin Holt > > Signed-off-by: Nate Zimmer > > To: "H. Peter Anvin" > > To: Ingo Molnar > > Cc: Linux Kernel > > Cc: Linux MM > > Cc: Rob Landley > > Cc: Mike Travis > > Cc: Daniel J Blueman > > Cc: Andrew Morton > > Cc: Greg KH > > Cc: Yinghai Lu > > Cc: Mel Gorman > > --- > > include/linux/mm.h | 11 +++++ > > include/linux/page-flags.h | 5 +- > > mm/nobootmem.c | 5 ++ > > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > > 4 files changed, 132 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index e0c8528..3de08b5 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > + > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > PG_private, /* If pagecache, has fs-private data */ > > PG_private_2, /* If pagecache, has fs aux data */ > > PG_writeback, /* Page is under writeback */ > > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > > > __PAGEFLAG(SlobFree, slob_free) > > > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > > + > > /* > > * Private page markings that may be used by the filesystem that owns the page > > * for its own purposes. > > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > > #define PAGE_FLAGS_CHECK_AT_FREE \ > > (1 << PG_lru | 1 << PG_locked | \ > > 1 << PG_private | 1 << PG_private_2 | \ > > - 1 << PG_writeback | 1 << PG_reserved | \ > > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > > __PG_COMPOUND_LOCK) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > > index 3b512ca..e3a386d 100644 > > --- a/mm/nobootmem.c > > +++ b/mm/nobootmem.c > > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > > phys_addr_t start, end, size; > > u64 i; > > > > + for_each_reserved_mem_region(i, &start, &end) > > + __reserve_bootmem_region(start, end); > > + > > How about holes that is not in memblock.reserved? > > before this patch: > free_area_init_node/free_area_init_core/memmap_init_zone > will mark all page in node range to Reserved in struct page, at first. > > but those holes is not mapped via kernel low mapping. > so it should be ok not touch "struct page" for them. > > Now you only mark reserved for memblock.reserved at first, and later > mark {memblock.memory} - { memblock.reserved} to be available. > And that is ok. > > but should split that change to another patch and add some comment > and change log for the change. > in case there is some user like UEFI etc do weird thing. Nate and I talked this over today. Sorry for the delay, but it was the first time we were both free. Neither of us quite understands what you are asking for here. My interpretation is that you would like us to change the use of the PageReserved flag such that during boot, we do not set the flag at all from memmap_init_zone, and then only set it on pages which, at the time of free_all_bootmem, have been allocated/reserved in the memblock allocator. Is that correct? I will start to work that up on the assumption that is what you are asking for. Robin > > > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > > count += __free_memory_core(start, end); > > > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > > { > > struct pglist_data *pgdat; > > > > + memblock_dump_all(); > > + > > Not needed. > > > for_each_online_pgdat(pgdat) > > reset_node_lowmem_managed_pages(pgdat); > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 635b131..fe51eb9 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > + > > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > > +{ > > + unsigned long start_pfn = PFN_DOWN(start); > > + unsigned long end_pfn = PFN_UP(end); > > + > > + ensure_pages_are_initialized(start_pfn, end_pfn); > > +} > > that name is confusing, actually it is setting to struct page to Reserved only. > maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? > > > + > > +static inline void ensure_page_is_initialized(struct page *page) > > +{ > > + __reserve_bootmem_page(page); > > +} > > how about use __reserve_page_bootmem directly and add comment in callers site ? > > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > > if (PageAnon(page)) > > page->mapping = NULL; > > for (i = 0; i < (1 << order); i++) > > - bad += free_pages_check(page + i); > > + if (PageUninitialized2Mib(page + i)) > > + i += PTRS_PER_PMD - 1; > > + else > > + bad += free_pages_check(page + i); > > if (bad) > > return false; > > > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > > unsigned int loop; > > > > prefetchw(page); > > - for (loop = 0; loop < nr_pages; loop++) { > > + for (loop = 0; loop < nr_pages; ) { > > struct page *p = &page[loop]; > > > > if (loop + 1 < nr_pages) > > prefetchw(p + 1); > > + > > + if ((PageUninitialized2Mib(p)) && > > + ((loop + PTRS_PER_PMD) > nr_pages)) > > + ensure_page_is_initialized(p); > > + > > __ClearPageReserved(p); > > set_page_count(p, 0); > > + if (PageUninitialized2Mib(p)) > > + loop += PTRS_PER_PMD; > > + else > > + loop += 1; > > } > > > > page_zone(page)->managed_pages += 1 << order; > > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > > area--; > > high--; > > size >>= 1; > > + ensure_page_is_initialized(page); > > VM_BUG_ON(bad_range(zone, &page[size])); > > > > #ifdef CONFIG_DEBUG_PAGEALLOC > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > } > > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > > unsigned long order; > > int pages_moved = 0; > > > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > > #ifndef CONFIG_HOLES_IN_ZONE > > /* > > * page_zone is not safe to call in this context when > > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > > return 1; > > + > > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > > + pfn += PTRS_PER_PMD; > > } > > return 0; > > } > > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > > } > > } > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > why not use static ? it seems there is not outside user. > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > + > > /* > > * Initially all pages are reserved - free ones are freed > > * up by free_all_bootmem() once the early boot process is > > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > highest_memmap_pfn = end_pfn - 1; > > > > z = &NODE_DATA(nid)->node_zones[zone]; > > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > + for (pfn = start_pfn; pfn < end_pfn; ) { > > /* > > * There can be holes in boot-time mem_map[]s > > * handed to this function. They do not > > * exist on hotplugged memory. > > */ > > + int pfns = 1; > > if (context == MEMMAP_EARLY) { > > - if (!early_pfn_valid(pfn)) > > + if (!early_pfn_valid(pfn)) { > > + pfn++; > > continue; > > - if (!early_pfn_in_nid(pfn, nid)) > > + } > > + if (!early_pfn_in_nid(pfn, nid)) { > > + pfn++; > > continue; > > + } > > + > > + pfns = pfn_range_init_avail(pfn, end_pfn, > > + PTRS_PER_PMD, nid); > > } > > maybe could update memmap_init_zone() only iterate {memblock.memory} - > {memblock.reserved}, so you do need to check avail .... > > as memmap_init_zone do not need to handle holes to mark reserve for them. > > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? > > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx138.postini.com [74.125.245.138]) by kanga.kvack.org (Postfix) with SMTP id 381356B0031 for ; Thu, 25 Jul 2013 08:50:58 -0400 (EDT) Received: by mail-ob0-f182.google.com with SMTP id wo10so1392316obc.27 for ; Thu, 25 Jul 2013 05:50:57 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20130725022543.GR3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130725022543.GR3421@sgi.com> Date: Thu, 25 Jul 2013 05:50:57 -0700 Message-ID: Subject: Re: [RFC 4/4] Sparse initialization of struct page array. From: Yinghai Lu Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt wrote: >> >> How about holes that is not in memblock.reserved? >> >> before this patch: >> free_area_init_node/free_area_init_core/memmap_init_zone >> will mark all page in node range to Reserved in struct page, at first. >> >> but those holes is not mapped via kernel low mapping. >> so it should be ok not touch "struct page" for them. >> >> Now you only mark reserved for memblock.reserved at first, and later >> mark {memblock.memory} - { memblock.reserved} to be available. >> And that is ok. >> >> but should split that change to another patch and add some comment >> and change log for the change. >> in case there is some user like UEFI etc do weird thing. > > Nate and I talked this over today. Sorry for the delay, but it was the > first time we were both free. Neither of us quite understands what you > are asking for here. My interpretation is that you would like us to > change the use of the PageReserved flag such that during boot, we do not > set the flag at all from memmap_init_zone, and then only set it on pages > which, at the time of free_all_bootmem, have been allocated/reserved in > the memblock allocator. Is that correct? I will start to work that up > on the assumption that is what you are asking for. Not exactly. your change should be right, but there is some subtle change about holes handling. before mem holes between memory ranges in memblock.memory, get struct page, and initialized with to Reserved in memmap_init_zone. Those holes is not in memory.reserved, with your patches, those hole's struct page will still have all 0. Please separate change about set page to reserved according to memory.reserved to another patch. Thanks Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx106.postini.com [74.125.245.106]) by kanga.kvack.org (Postfix) with SMTP id 19A5F6B0031 for ; Thu, 25 Jul 2013 09:42:29 -0400 (EDT) Date: Thu, 25 Jul 2013 08:42:27 -0500 From: Robin Holt Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130725134227.GT3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130725022543.GR3421@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Yinghai Lu Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Thu, Jul 25, 2013 at 05:50:57AM -0700, Yinghai Lu wrote: > On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt wrote: > >> > >> How about holes that is not in memblock.reserved? > >> > >> before this patch: > >> free_area_init_node/free_area_init_core/memmap_init_zone > >> will mark all page in node range to Reserved in struct page, at first. > >> > >> but those holes is not mapped via kernel low mapping. > >> so it should be ok not touch "struct page" for them. > >> > >> Now you only mark reserved for memblock.reserved at first, and later > >> mark {memblock.memory} - { memblock.reserved} to be available. > >> And that is ok. > >> > >> but should split that change to another patch and add some comment > >> and change log for the change. > >> in case there is some user like UEFI etc do weird thing. > > > > Nate and I talked this over today. Sorry for the delay, but it was the > > first time we were both free. Neither of us quite understands what you > > are asking for here. My interpretation is that you would like us to > > change the use of the PageReserved flag such that during boot, we do not > > set the flag at all from memmap_init_zone, and then only set it on pages > > which, at the time of free_all_bootmem, have been allocated/reserved in > > the memblock allocator. Is that correct? I will start to work that up > > on the assumption that is what you are asking for. > > Not exactly. > > your change should be right, but there is some subtle change about > holes handling. > > before mem holes between memory ranges in memblock.memory, get struct page, > and initialized with to Reserved in memmap_init_zone. > Those holes is not in memory.reserved, with your patches, those hole's > struct page > will still have all 0. > > Please separate change about set page to reserved according to memory.reserved > to another patch. Just want to make sure this is where you want me to go. Here is my currently untested patch. Is that what you were expecting to have done? One thing I don't like about this patch is it seems to slow down boot in my simulator environment. I think I am going to look at restructuring things a bit to see if I can eliminate that performance penalty. Otherwise, I think I am following your directions. Thanks, Robin Holt From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx206.postini.com [74.125.245.206]) by kanga.kvack.org (Postfix) with SMTP id E96FC6B0031 for ; Thu, 25 Jul 2013 09:52:16 -0400 (EDT) Received: by mail-oa0-f42.google.com with SMTP id j6so4411318oag.15 for ; Thu, 25 Jul 2013 06:52:15 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20130725134227.GT3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130725022543.GR3421@sgi.com> <20130725134227.GT3421@sgi.com> Date: Thu, 25 Jul 2013 06:52:15 -0700 Message-ID: Subject: Re: [RFC 4/4] Sparse initialization of struct page array. From: Yinghai Lu Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman On Thu, Jul 25, 2013 at 6:42 AM, Robin Holt wrote: > On Thu, Jul 25, 2013 at 05:50:57AM -0700, Yinghai Lu wrote: >> On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt wrote: >> >> >> >> How about holes that is not in memblock.reserved? >> >> >> >> before this patch: >> >> free_area_init_node/free_area_init_core/memmap_init_zone >> >> will mark all page in node range to Reserved in struct page, at first. >> >> >> >> but those holes is not mapped via kernel low mapping. >> >> so it should be ok not touch "struct page" for them. >> >> >> >> Now you only mark reserved for memblock.reserved at first, and later >> >> mark {memblock.memory} - { memblock.reserved} to be available. >> >> And that is ok. >> >> >> >> but should split that change to another patch and add some comment >> >> and change log for the change. >> >> in case there is some user like UEFI etc do weird thing. >> > >> > Nate and I talked this over today. Sorry for the delay, but it was the >> > first time we were both free. Neither of us quite understands what you >> > are asking for here. My interpretation is that you would like us to >> > change the use of the PageReserved flag such that during boot, we do not >> > set the flag at all from memmap_init_zone, and then only set it on pages >> > which, at the time of free_all_bootmem, have been allocated/reserved in >> > the memblock allocator. Is that correct? I will start to work that up >> > on the assumption that is what you are asking for. >> >> Not exactly. >> >> your change should be right, but there is some subtle change about >> holes handling. >> >> before mem holes between memory ranges in memblock.memory, get struct page, >> and initialized with to Reserved in memmap_init_zone. >> Those holes is not in memory.reserved, with your patches, those hole's >> struct page >> will still have all 0. >> >> Please separate change about set page to reserved according to memory.reserved >> to another patch. > > > Just want to make sure this is where you want me to go. Here is my > currently untested patch. Is that what you were expecting to have done? > One thing I don't like about this patch is it seems to slow down boot in > my simulator environment. I think I am going to look at restructuring > things a bit to see if I can eliminate that performance penalty. > Otherwise, I think I am following your directions. > > From bdd2fefa59af18f283af6f066bc644ddfa5c7da8 Mon Sep 17 00:00:00 2001 > From: Robin Holt > Date: Thu, 25 Jul 2013 04:25:15 -0500 > Subject: [RFC -v2-pre2 4/5] ZZZ Only SegPageReserved() on memblock reserved > pages. yes. > > --- > include/linux/mm.h | 2 ++ > mm/nobootmem.c | 3 +++ > mm/page_alloc.c | 18 +++++++++++------- > 3 files changed, 16 insertions(+), 7 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..b264a26 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) > totalram_pages += count; > } > > +extern void reserve_bootmem_region(unsigned long start, unsigned long end); > + > /* Free the reserved page into the buddy system, so it gets managed. */ > static inline void __free_reserved_page(struct page *page) > { > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 2159e68..0840af2 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + reserve_bootmem_region(start, end); > + > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 048e166..3aa30b7 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -698,7 +698,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > } > > static void __init_single_page(unsigned long pfn, unsigned long zone, > - int nid, int reserved) > + int nid, int page_count) > { > struct page *page = pfn_to_page(pfn); > struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > @@ -708,12 +708,9 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, > init_page_count(page); > page_mapcount_reset(page); > page_nid_reset_last(page); > - if (reserved) { > - SetPageReserved(page); > - } else { > - ClearPageReserved(page); > - set_page_count(page, 0); > - } > + ClearPageReserved(page); > + set_page_count(page, page_count); > + > /* > * Mark the block movable so that blocks are reserved for > * movable at startup. This will force kernel allocations > @@ -741,6 +738,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, > #endif > } > > +void reserve_bootmem_region(unsigned long start, unsigned long end) > +{ > + for (; start < end; start++) > + if (pfn_valid(start)) > + SetPageReserved(pfn_to_page(start)); > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > -- > 1.8.2.1 > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx187.postini.com [74.125.245.187]) by kanga.kvack.org (Postfix) with SMTP id 3ADC96B0032 for ; Fri, 2 Aug 2013 13:44:41 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v2 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Date: Fri, 2 Aug 2013 12:44:22 -0500 Message-Id: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de We are still restricting ourselves ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. We are still struggling with the expand(). Nearly always the first reference to a struct page which is in the middle of the 2MiB region. We were unable to find a good solution. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. The only other fastpath impact left is the expansion in prep_new_page. With this patch, we did boot a 16TiB machine. The two main areas that benefit from this patch is free_all_bootmem and memmap_init_zone. Without the patches it took 407 seconds and 1151 seconds respectively. With the patches it took 220 and 49 seconds respectively. This is a total savings of 1289 seconds (21 minutes). These times were aquired using a modified version of script which record the time in uSecs at the beginning of each line of output. The previous patch set was faster through free_all_bootmem but I wanted to include Yinghai suggestion. Hopefully I didn't miss the mark too much with that patch and yes I do still need to optimize it. I know there are some still rough parts but I wanted to check in with the full patch set. Nathan Zimmer (1): Only set page reserved in the memblock region Robin Holt (4): memblock: Introduce a for_each_reserved_mem_region iterator. Have __free_pages_memory() free in larger chunks. Move page initialization into a separate function. Sparse initialization of struct page array. include/linux/memblock.h | 18 +++++ include/linux/mm.h | 2 + include/linux/page-flags.h | 5 +- mm/memblock.c | 32 ++++++++ mm/mm_init.c | 2 +- mm/nobootmem.c | 28 +++---- mm/page_alloc.c | 194 ++++++++++++++++++++++++++++++++++++--------- 7 files changed, 225 insertions(+), 56 deletions(-) -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx132.postini.com [74.125.245.132]) by kanga.kvack.org (Postfix) with SMTP id 5D3CD6B0037 for ; Fri, 2 Aug 2013 13:44:41 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v2 4/5] Only set page reserved in the memblock region Date: Fri, 2 Aug 2013 12:44:26 -0500 Message-Id: <1375465467-40488-5-git-send-email-nzimmer@sgi.com> In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Currently we when we initialze each page struct is set as reserved upon initialization. This changes to starting with the reserved bit clear and then only setting the bit in the reserved region. I could restruture a bit to eliminate the perform hit. But I wanted to make sure I am on track first. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 16 ++++++++++++---- 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..b264a26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) totalram_pages += count; } +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 2159e68..0840af2 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index df3ec13..382223e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +static void __init_single_page(unsigned long pfn, unsigned long zone, + int nid, int page_count) { struct page *page = pfn_to_page(pfn); struct zone *z = &NODE_DATA(nid)->node_zones[zone]; set_page_links(page, zone, nid, pfn); mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); page_mapcount_reset(page); page_nid_reset_last(page); - SetPageReserved(page); + set_page_count(page, page_count); + ClearPageReserved(page); /* * Mark the block movable so that blocks are reserved for @@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) #endif } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + for (; start < end; start++) + if (pfn_valid(start)) + SetPageReserved(pfn_to_page(start)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - __init_single_page(pfn, zone, nid); + __init_single_page(pfn, zone, nid, 1); } } -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx153.postini.com [74.125.245.153]) by kanga.kvack.org (Postfix) with SMTP id 4F62E6B0033 for ; Fri, 2 Aug 2013 13:44:41 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v2 1/5] memblock: Introduce a for_each_reserved_mem_region iterator. Date: Fri, 2 Aug 2013 12:44:23 -0500 Message-Id: <1375465467-40488-2-git-send-email-nzimmer@sgi.com> In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de From: Robin Holt As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx195.postini.com [74.125.245.195]) by kanga.kvack.org (Postfix) with SMTP id 52B206B0036 for ; Fri, 2 Aug 2013 13:44:41 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v2 3/5] Move page initialization into a separate function. Date: Fri, 2 Aug 2013 12:44:25 -0500 Message-Id: <1375465467-40488-4-git-send-email-nzimmer@sgi.com> In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de From: Robin Holt Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/mm_init.c | 2 +- mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++-------------------------- 2 files changed, 41 insertions(+), 34 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5adf81e..df3ec13 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,45 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +{ + struct page *page = pfn_to_page(pfn); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + SetPageReserved(page); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3951,7 +3990,6 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { - struct page *page; unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; @@ -3972,38 +4010,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(pfn, zone, nid); } } -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx136.postini.com [74.125.245.136]) by kanga.kvack.org (Postfix) with SMTP id 519B96B0034 for ; Fri, 2 Aug 2013 13:44:41 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v2 2/5] Have __free_pages_memory() free in larger chunks. Date: Fri, 2 Aug 2013 12:44:24 -0500 Message-Id: <1375465467-40488-3-git-send-email-nzimmer@sgi.com> In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de From: Robin Holt Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased. We are increasing to the maximum size available. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/nobootmem.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..2159e68 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -82,27 +82,18 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { - unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order; - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + while (start < end) { + order = min(MAX_ORDER - 1, __ffs(start)); - if (end_aligned <= start_aligned) { - for (i = start; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + while (start + (1UL << order) > end) + order--; - return; - } - - for (i = start; i < start_aligned; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + __free_pages_bootmem(pfn_to_page(start), order); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) - __free_pages_bootmem(pfn_to_page(i), order); - - for (i = end_aligned; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + start += (1UL << order); + } } static unsigned long __init __free_memory_core(phys_addr_t start, -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx135.postini.com [74.125.245.135]) by kanga.kvack.org (Postfix) with SMTP id 14ED86B0037 for ; Fri, 2 Aug 2013 13:44:41 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v2 5/5] Sparse initialization of struct page array. Date: Fri, 2 Aug 2013 12:44:27 -0500 Message-Id: <1375465467-40488-6-git-send-email-nzimmer@sgi.com> In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de From: Robin Holt During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/page-flags.h | 5 +- mm/page_alloc.c | 120 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 119 insertions(+), 6 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..d592065 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized_2m, PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2m, uninitialized_2m) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized_2m | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 382223e..c2fd03a0c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -737,8 +737,53 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, #endif } +static void expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int count = page_count(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2m(basepage); + + for (pfn++; pfn < end_pfn; pfn++) + __init_single_page(pfn, zone, nid, count); +} + +static void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + ensure_pages_are_initialized(page_to_pfn(page), page_to_pfn(page)); +} + void reserve_bootmem_region(unsigned long start, unsigned long end) { + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + ensure_pages_are_initialized(start_pfn, end_pfn); + for (; start < end; start++) if (pfn_valid(start)) SetPageReserved(pfn_to_page(start)); @@ -755,7 +800,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2m(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -799,13 +847,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2m(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2m(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -860,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -907,6 +965,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2m(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; } @@ -989,6 +1051,8 @@ int move_freepages(struct zone *zone, unsigned long order; int pages_moved = 0; + ensure_pages_are_initialized(page_to_pfn(start_page), + page_to_pfn(end_page)); #ifndef CONFIG_HOLES_IN_ZONE /* * page_zone is not safe to call in this context when @@ -3902,6 +3966,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) return 1; + + if (PageUninitialized2m(pfn_to_page(pfn))) + pfn += PTRS_PER_PMD; } return 0; } @@ -3991,6 +4058,34 @@ static void setup_zone_migrate_reserve(struct zone *zone) } /* + * This function tells us if we have many pfns we have available. + * Available meaning valid and on the specified node. + * It return either size if that many pfns are available, 1 otherwise + */ +static int __meminit pfn_range_init_avail(unsigned long pfn, + unsigned long end_pfn, + unsigned long size, int nid) +{ + unsigned long validate_end_pfn = pfn + size; + + if (pfn & (size - 1)) + return 1; + + if (pfn + size >= end_pfn) + return 1; + + while (pfn < validate_end_pfn) { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + +/* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is * done. Non-atomic initialization, single-pass. @@ -4006,19 +4101,33 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + __init_single_page(pfn, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2m(pfn_to_page(pfn)); + + pfn += pfns; } } @@ -6237,6 +6346,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized_2m, "uninitialized_2m" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx172.postini.com [74.125.245.172]) by kanga.kvack.org (Postfix) with SMTP id 0E7466B0031 for ; Sat, 3 Aug 2013 16:04:56 -0400 (EDT) Date: Sat, 3 Aug 2013 15:04:54 -0500 From: Nathan Zimmer Subject: Re: [RFC v2 4/5] Only set page reserved in the memblock region Message-ID: <20130803200453.GA185972@asylum.americas.sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1375465467-40488-5-git-send-email-nzimmer@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1375465467-40488-5-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nathan Zimmer Cc: hpa@zytor.com, mingo@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de On Fri, Aug 02, 2013 at 12:44:26PM -0500, Nathan Zimmer wrote: > Currently we when we initialze each page struct is set as reserved upon > initialization. This changes to starting with the reserved bit clear and > then only setting the bit in the reserved region. > > I could restruture a bit to eliminate the perform hit. But I wanted to make > sure I am on track first. > > Signed-off-by: Robin Holt > Signed-off-by: Nathan Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > --- > include/linux/mm.h | 2 ++ > mm/nobootmem.c | 3 +++ > mm/page_alloc.c | 16 ++++++++++++---- > 3 files changed, 17 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..b264a26 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) > totalram_pages += count; > } > > +extern void reserve_bootmem_region(unsigned long start, unsigned long end); > + > /* Free the reserved page into the buddy system, so it gets managed. */ > static inline void __free_reserved_page(struct page *page) > { > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 2159e68..0840af2 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + reserve_bootmem_region(start, end); > + > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index df3ec13..382223e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > spin_unlock(&zone->lock); > } > > -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) > +static void __init_single_page(unsigned long pfn, unsigned long zone, > + int nid, int page_count) > { > struct page *page = pfn_to_page(pfn); > struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > > set_page_links(page, zone, nid, pfn); > mminit_verify_page_links(page, zone, nid, pfn); > - init_page_count(page); > page_mapcount_reset(page); > page_nid_reset_last(page); > - SetPageReserved(page); > + set_page_count(page, page_count); > + ClearPageReserved(page); > > /* > * Mark the block movable so that blocks are reserved for > @@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) > #endif > } > > +void reserve_bootmem_region(unsigned long start, unsigned long end) > +{ > + for (; start < end; start++) > + if (pfn_valid(start)) > + SetPageReserved(pfn_to_page(start)); > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > if (!early_pfn_in_nid(pfn, nid)) > continue; > } > - __init_single_page(pfn, zone, nid); > + __init_single_page(pfn, zone, nid, 1); > } > } > > -- > 1.8.2.1 > Actually I believe reserve_bootmem_region is wrong. I am passing in phys_adr_t and not pfns. It should be: void reserve_bootmem_region(unsigned long start, unsigned long end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); for (; start_pfn < end_pfn; start_pfn++) if (pfn_valid(start_pfn)) SetPageReserved(pfn_to_page(start_pfn)); } That also brings the timings back in line with the previous patch set. Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx152.postini.com [74.125.245.152]) by kanga.kvack.org (Postfix) with SMTP id 1CAC16B0034 for ; Mon, 5 Aug 2013 05:58:18 -0400 (EDT) Received: by mail-bk0-f41.google.com with SMTP id jc10so895880bkc.14 for ; Mon, 05 Aug 2013 02:58:16 -0700 (PDT) Date: Mon, 5 Aug 2013 11:58:12 +0200 From: Ingo Molnar Subject: Re: [RFC v2 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130805095812.GA29404@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nathan Zimmer Cc: hpa@zytor.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de * Nathan Zimmer wrote: > We are still restricting ourselves ourselves to 2MiB initialization to > keep the patch set a little smaller and more clear. > > We are still struggling with the expand(). Nearly always the first > reference to a struct page which is in the middle of the 2MiB region. > We were unable to find a good solution. Also, given the strong warning > at the head of expand(), we did not feel experienced enough to refactor > it to make things always reference the 2MiB page first. The only other > fastpath impact left is the expansion in prep_new_page. I suppose it's about this chunk: @@ -860,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); where ensure_page_is_initialized() does, in essence: + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } where aligned_start_pfn is 2MB rounded down. which looks like an expensive loop to execute for a single page: there are 512 pages in a 2MB range, so on average this iterates 256 times, for every single page of allocation. Right? I might be missing something, but why not just represent the initialization state in 2MB chunks: it is either fully uninitialized, or fully initialized. If any page in the 'middle' gets allocated, all page heads have to get initialized. That should make the fast path test fairly cheap, basically just PageUninitialized2m(page) has to be tested - and that will fail in the post-initialization fastpath. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx117.postini.com [74.125.245.117]) by kanga.kvack.org (Postfix) with SMTP id 586CA6B0036 for ; Mon, 12 Aug 2013 17:54:44 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Date: Mon, 12 Aug 2013 16:54:35 -0500 Message-Id: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de We are still restricting ourselves ourselves to 2MiB initialization. This was initially to keep the patch set a little smaller and more clear. However given how well it is currently performing I don't see a how much better it could be with to 2GiB chunks. As far as extra overhead. We incur an extra function call to ensure_page_is_initialized but that is only really expensive when we find uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. To get a better feel for this we ran two quick tests. The first was simply timing some memhogs. This showed no measurable difference so we made a more granular test. We spawned N threads, start a timer, each thread mallocs totalmem/N then each thread writes to its memory to induce page faults, stop the timer. In this case it each thread had just under 4GB of ram to fault in. This showed a measureable difference in the page faulting. The baseline took an average of 2.68 seconds, the new version took an average of 2.75 seconds. Which is .07s slower or 2.6%. Are there some other tests I should run? With this patch, we did boot a 16TiB machine. The two main areas that benefit from this patch is free_all_bootmem and memmap_init_zone. Without the patches it took 407 seconds and 1151 seconds respectively. With the patches it took 13 and 39 seconds respectively. This is a total savings of 1506 seconds (25 minutes). These times were acquired using a modified version of script which record the time in uSecs at the beginning of each line of output. Overall I am fairly happy with the patch set at the moment. It improves boot times without noticeable runtime overhead. I am, as always, open for suggestions. v2: included the Yinghai's suggestion to not set the reserved bit until later. v3: Corrected my first attempt at moving the reserved bit. __expand_page_initialization should only be called by ensure_pages_are_initialized Nathan Zimmer (1): Only set page reserved in the memblock region Robin Holt (4): memblock: Introduce a for_each_reserved_mem_region iterator. Have __free_pages_memory() free in larger chunks. Move page initialization into a separate function. Sparse initialization of struct page array. include/linux/memblock.h | 18 +++++ include/linux/mm.h | 2 + include/linux/page-flags.h | 5 +- mm/memblock.c | 32 ++++++++ mm/mm_init.c | 2 +- mm/nobootmem.c | 28 +++---- mm/page_alloc.c | 198 ++++++++++++++++++++++++++++++++++++--------- 7 files changed, 229 insertions(+), 56 deletions(-) -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx163.postini.com [74.125.245.163]) by kanga.kvack.org (Postfix) with SMTP id 8496E6B0037 for ; Mon, 12 Aug 2013 17:54:44 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v3 1/5] memblock: Introduce a for_each_reserved_mem_region iterator. Date: Mon, 12 Aug 2013 16:54:36 -0500 Message-Id: <1376344480-156708-2-git-send-email-nzimmer@sgi.com> In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de From: Robin Holt As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx123.postini.com [74.125.245.123]) by kanga.kvack.org (Postfix) with SMTP id A487A6B0039 for ; Mon, 12 Aug 2013 17:54:44 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v3 3/5] Move page initialization into a separate function. Date: Mon, 12 Aug 2013 16:54:38 -0500 Message-Id: <1376344480-156708-4-git-send-email-nzimmer@sgi.com> In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de From: Robin Holt Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/mm_init.c | 2 +- mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++-------------------------- 2 files changed, 41 insertions(+), 34 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5adf81e..df3ec13 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,45 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +{ + struct page *page = pfn_to_page(pfn); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + SetPageReserved(page); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3951,7 +3990,6 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { - struct page *page; unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; @@ -3972,38 +4010,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(pfn, zone, nid); } } -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx181.postini.com [74.125.245.181]) by kanga.kvack.org (Postfix) with SMTP id B18E86B003A for ; Mon, 12 Aug 2013 17:54:44 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v3 4/5] Only set page reserved in the memblock region Date: Mon, 12 Aug 2013 16:54:39 -0500 Message-Id: <1376344480-156708-5-git-send-email-nzimmer@sgi.com> In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Currently we when we initialze each page struct is set as reserved upon initialization. This changes to starting with the reserved bit clear and then only setting the bit in the reserved region. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 19 +++++++++++++++---- 3 files changed, 20 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..b264a26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) totalram_pages += count; } +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 2159e68..0840af2 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index df3ec13..227bd39 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +static void __init_single_page(unsigned long pfn, unsigned long zone, + int nid, int page_count) { struct page *page = pfn_to_page(pfn); struct zone *z = &NODE_DATA(nid)->node_zones[zone]; set_page_links(page, zone, nid, pfn); mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); page_mapcount_reset(page); page_nid_reset_last(page); - SetPageReserved(page); + set_page_count(page, page_count); + ClearPageReserved(page); /* * Mark the block movable so that blocks are reserved for @@ -736,6 +737,16 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) #endif } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + for (; start_pfn < end_pfn; start_pfn++) + if (pfn_valid(start_pfn)) + SetPageReserved(pfn_to_page(start_pfn)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -4010,7 +4021,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - __init_single_page(pfn, zone, nid); + __init_single_page(pfn, zone, nid, 1); } } -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx140.postini.com [74.125.245.140]) by kanga.kvack.org (Postfix) with SMTP id 9D0F56B0038 for ; Mon, 12 Aug 2013 17:54:44 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v3 2/5] Have __free_pages_memory() free in larger chunks. Date: Mon, 12 Aug 2013 16:54:37 -0500 Message-Id: <1376344480-156708-3-git-send-email-nzimmer@sgi.com> In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de From: Robin Holt Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased. We are increasing to the maximum size available. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/nobootmem.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..2159e68 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -82,27 +82,18 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { - unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order; - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + while (start < end) { + order = min(MAX_ORDER - 1, __ffs(start)); - if (end_aligned <= start_aligned) { - for (i = start; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + while (start + (1UL << order) > end) + order--; - return; - } - - for (i = start; i < start_aligned; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + __free_pages_bootmem(pfn_to_page(start), order); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) - __free_pages_bootmem(pfn_to_page(i), order); - - for (i = end_aligned; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + start += (1UL << order); + } } static unsigned long __init __free_memory_core(phys_addr_t start, -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx127.postini.com [74.125.245.127]) by kanga.kvack.org (Postfix) with SMTP id E52BC6B003B for ; Mon, 12 Aug 2013 17:54:44 -0400 (EDT) From: Nathan Zimmer Subject: [RFC v3 5/5] Sparse initialization of struct page array. Date: Mon, 12 Aug 2013 16:54:40 -0500 Message-Id: <1376344480-156708-6-git-send-email-nzimmer@sgi.com> In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de From: Robin Holt During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/page-flags.h | 5 +- mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 115 insertions(+), 6 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..d592065 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized_2m, PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2m, uninitialized_2m) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized_2m | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 227bd39..6c35a58 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -737,11 +737,53 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, #endif } +static void __expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int count = page_count(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2m(basepage); + + for (pfn++; pfn < end_pfn; pfn++) + __init_single_page(pfn, zone, nid, count); +} + +static void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + __expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + ensure_pages_are_initialized(page_to_pfn(page), page_to_pfn(page)); +} + void reserve_bootmem_region(unsigned long start, unsigned long end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); + ensure_pages_are_initialized(start_pfn, end_pfn); + for (; start_pfn < end_pfn; start_pfn++) if (pfn_valid(start_pfn)) SetPageReserved(pfn_to_page(start_pfn)); @@ -758,7 +800,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2m(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -802,13 +847,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2m(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2m(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -863,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(&page[size]); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -908,8 +963,11 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) { int i; + ensure_pages_are_initialized(page_to_pfn(page), + page_to_pfn(page+(1<= end_pfn) + return 1; + + while (pfn < validate_end_pfn) { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + +/* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is * done. Non-atomic initialization, single-pass. @@ -4009,19 +4100,33 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + __init_single_page(pfn, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2m(pfn_to_page(pfn)); + + pfn += pfns; } } @@ -6240,6 +6345,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized_2m, "uninitialized_2m" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx173.postini.com [74.125.245.173]) by kanga.kvack.org (Postfix) with SMTP id 9EF096B0032 for ; Tue, 13 Aug 2013 06:58:51 -0400 (EDT) Received: by mail-ee0-f44.google.com with SMTP id b47so4086376eek.17 for ; Tue, 13 Aug 2013 03:58:50 -0700 (PDT) Date: Tue, 13 Aug 2013 12:58:47 +0200 From: Ingo Molnar Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130813105847.GC2170@gmail.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Sender: owner-linux-mm@kvack.org List-ID: To: Nathan Zimmer Cc: hpa@zytor.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de, Linus Torvalds * Nathan Zimmer wrote: > We are still restricting ourselves ourselves to 2MiB initialization. > This was initially to keep the patch set a little smaller and more > clear. However given how well it is currently performing I don't see a > how much better it could be with to 2GiB chunks. > > As far as extra overhead. We incur an extra function call to > ensure_page_is_initialized but that is only really expensive when we > find uninitialized pages, otherwise it is a flag check once every > PTRS_PER_PMD. [...] Mind expanding on this in more detail? The main fastpath overhead we are really interested in is the 'memory is already fully ininialized and we reallocate a second time' case - i.e. the *second* (and subsequent), post-initialization allocation of any page range. Those allocations are the ones that matter most: they will occur again and again, for the lifetime of the booted up system. What extra overhead is there in that case? Only a flag check that is merged into an existing flag check (in free_pages_check()) and thus is essentially zero overhead? Or is it more involved - if yes, why? One would naively think that nothing but the flags check is needed in this case: if all 512 pages in an aligned 2MB block is fully initialized, and marked as initialized in all the 512 page heads, then no other runtime check will be needed in the future. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx138.postini.com [74.125.245.138]) by kanga.kvack.org (Postfix) with SMTP id BD8716B0032 for ; Tue, 13 Aug 2013 13:09:35 -0400 (EDT) Received: by mail-vb0-f54.google.com with SMTP id q14so5963797vbe.13 for ; Tue, 13 Aug 2013 10:09:34 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Date: Tue, 13 Aug 2013 10:09:34 -0700 Message-ID: Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Nathan Zimmer Cc: Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman On Mon, Aug 12, 2013 at 2:54 PM, Nathan Zimmer wrote: > > As far as extra overhead. We incur an extra function call to > ensure_page_is_initialized but that is only really expensive when we find > uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. > To get a better feel for this we ran two quick tests. Sorry for coming into this late and for this last version of the patch, but I have to say that I'd *much* rather see this delayed initialization using another data structure than hooking into the basic page allocation ones.. I understand that you want to do delayed initialization on some TB+ memory machines, but what I don't understand is why it has to be done when the pages have already been added to the memory management free list. Could we not do this much simpler: make the early boot insert the first few gigs of memory (initialized) synchronously into the free lists, and then have a background thread that goes through the rest? That way the MM layer would never see the uninitialized pages. And I bet that *nobody* cares if you "only" have a few gigs of ram during the first few minutes of boot, and you mysteriously end up getting more and more memory for a while until all the RAM has been initialized. IOW, just don't call __free_pages_bootmem() on all the pages al at once. If we have to remove a few __init markers to be able to do some of it later, does anybody really care? I really really dislike this "let's check if memory is initialized at runtime" approach. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx108.postini.com [74.125.245.108]) by kanga.kvack.org (Postfix) with SMTP id BECB36B0032 for ; Tue, 13 Aug 2013 13:24:17 -0400 (EDT) Message-ID: <520A6BA2.7060800@zytor.com> Date: Tue, 13 Aug 2013 10:23:46 -0700 From: "H. Peter Anvin" MIME-Version: 1.0 Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Nathan Zimmer , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman On 08/13/2013 10:09 AM, Linus Torvalds wrote: > > I really really dislike this "let's check if memory is initialized at > runtime" approach. > It does seem to be getting messy, doesn't it... The one potential serious concern if if that will end up mucking with NUMA affinity in a way that has lasting effects past boot. -hpa -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx155.postini.com [74.125.245.155]) by kanga.kvack.org (Postfix) with SMTP id 1F3E36B0032 for ; Tue, 13 Aug 2013 13:33:55 -0400 (EDT) Message-ID: <520A6DFC.1070201@sgi.com> Date: Tue, 13 Aug 2013 10:33:48 -0700 From: Mike Travis MIME-Version: 1.0 Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman On 8/13/2013 10:09 AM, Linus Torvalds wrote: > On Mon, Aug 12, 2013 at 2:54 PM, Nathan Zimmer wrote: >> >> As far as extra overhead. We incur an extra function call to >> ensure_page_is_initialized but that is only really expensive when we find >> uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. >> To get a better feel for this we ran two quick tests. > > Sorry for coming into this late and for this last version of the > patch, but I have to say that I'd *much* rather see this delayed > initialization using another data structure than hooking into the > basic page allocation ones.. > > I understand that you want to do delayed initialization on some TB+ > memory machines, but what I don't understand is why it has to be done > when the pages have already been added to the memory management free > list. > > Could we not do this much simpler: make the early boot insert the > first few gigs of memory (initialized) synchronously into the free > lists, and then have a background thread that goes through the rest? > > That way the MM layer would never see the uninitialized pages. > > And I bet that *nobody* cares if you "only" have a few gigs of ram > during the first few minutes of boot, and you mysteriously end up > getting more and more memory for a while until all the RAM has been > initialized. > > IOW, just don't call __free_pages_bootmem() on all the pages al at > once. If we have to remove a few __init markers to be able to do some > of it later, does anybody really care? > > I really really dislike this "let's check if memory is initialized at > runtime" approach. > > Linus > Initially this patch set consisted of diverting a major portion of the memory to an "absent" list during e820 processing. A very late initcall was then used to dispatch a cpu per node to add that nodes's absent memory. By nature these ran in parallel so Nathan did the work to "parallelize" various global resource locks to become per node locks. This sped up insertion considerably. And by disabling the "auto-start" of the insertion process and using a manual start command, you could monitor the insertion process and find hot spots in the memory initialization code. Also small updates to the sys/devices/{memory,node} drivers to also display the amount of memory still "absent". -Mike -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx134.postini.com [74.125.245.134]) by kanga.kvack.org (Postfix) with SMTP id A12756B0032 for ; Tue, 13 Aug 2013 13:51:38 -0400 (EDT) Received: by mail-vb0-f42.google.com with SMTP id e12so6928514vbg.29 for ; Tue, 13 Aug 2013 10:51:37 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <520A6DFC.1070201@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> Date: Tue, 13 Aug 2013 10:51:37 -0700 Message-ID: Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Mike Travis Cc: Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman On Tue, Aug 13, 2013 at 10:33 AM, Mike Travis wrote: > > Initially this patch set consisted of diverting a major portion of the > memory to an "absent" list during e820 processing. A very late initcall > was then used to dispatch a cpu per node to add that nodes's absent > memory. By nature these ran in parallel so Nathan did the work to > "parallelize" various global resource locks to become per node locks. So quite frankly, I'm not sure how worthwhile it even is to parallelize the thing. I realize that some environments may care about getting up to full memory population very quicky, but I think it would be very rare and specialized, and shouldn't necessarily be part of the initial patches. And it really doesn't have to be an initcall at all - at least not a synchronous one. A late initcall to get the process *started*, but the process itself could easily be done with a separate thread asynchronously, and let the machine boot up while that thread is going. And in fact, I'd argue that instead of trying to make it fast and parallelize things excessively, you might want to make the memory initialization *slow*, and make all the rest of the bootup have higher priority. At that point, who cares if it takes 400 seconds to get all memory initialized? In fact, who cares if it takes twice that? Let's assume that the rest of the boot takes 30s (which is pretty aggressive for some big server with terabytes of memory), even if the memory initialization was running in the background and only during idle time for probing, I'm sure you'd have a few hundred gigs of RAM initialized by the time you can log in. And if it then takes another ten minutes until you have the full 16TB initialized, and some things might be a tad slower early on, does anybody really care? The machine will be up and running with plenty of memory, even if it may not be *all* the memory yet. I realize that benchmarking cares, and yes, I also realize that some benchmarks actually want to reboot the machine between some runs just to get repeatability, but if you're benchmarking a 16TB machine I'm guessing any serious benchmark that actually uses that much memory is going to take many hours to a few days to run anyway? Having some way to wait until the memory is all done (which might even be just a silly shell script that does "ps" and waits for the kernel threads to all go away) isn't going to kill the benchmark - and the benchmark itself will then not have to worry about hittinf the "oops, I need to initialize 2GB of RAM now because I hit an uninitialized page". Ok, so I don't know all the issues, and in many ways I don't even really care. You could do it other ways, I don't think this is a big deal. The part I hate is the runtime hook into the core MM page allocation code, so I'm just throwing out any random thing that comes to my mind that could be used to avoid that part. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx143.postini.com [74.125.245.143]) by kanga.kvack.org (Postfix) with SMTP id 5DE8E6B0032 for ; Tue, 13 Aug 2013 14:04:07 -0400 (EDT) Message-ID: <520A7514.9020008@sgi.com> Date: Tue, 13 Aug 2013 11:04:04 -0700 From: Mike Travis MIME-Version: 1.0 Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman On 8/13/2013 10:51 AM, Linus Torvalds wrote: > by the time you can log in. And if it then takes another ten minutes > until you have the full 16TB initialized, and some things might be a > tad slower early on, does anybody really care? The machine will be up > and running with plenty of memory, even if it may not be *all* the > memory yet. Before the patches adding memory took ~45 mins for 16TB and almost 2 hours for 32TB. Adding it late sped up early boot but late insertion was still very slow, where the full 32TB was still not fully inserted after an hour. Doing it in parallel along with the memory hotplug lock per node, we got it down to the 10-15 minute range. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx188.postini.com [74.125.245.188]) by kanga.kvack.org (Postfix) with SMTP id 872436B0032 for ; Tue, 13 Aug 2013 15:06:25 -0400 (EDT) Message-ID: <520A83B0.40607@sgi.com> Date: Tue, 13 Aug 2013 12:06:24 -0700 From: Mike Travis MIME-Version: 1.0 Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <520A7514.9020008@sgi.com> In-Reply-To: <520A7514.9020008@sgi.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman On 8/13/2013 11:04 AM, Mike Travis wrote: > > > On 8/13/2013 10:51 AM, Linus Torvalds wrote: >> by the time you can log in. And if it then takes another ten minutes >> until you have the full 16TB initialized, and some things might be a >> tad slower early on, does anybody really care? The machine will be up >> and running with plenty of memory, even if it may not be *all* the >> memory yet. > > Before the patches adding memory took ~45 mins for 16TB and almost 2 hours > for 32TB. Adding it late sped up early boot but late insertion was still > very slow, where the full 32TB was still not fully inserted after an hour. > Doing it in parallel along with the memory hotplug lock per node, we got > it down to the 10-15 minute range. > FYI, the system at this time had 128 nodes each with 256GB of memory. About 252GB was inserted into the absent list from nodes 1 .. 126. Memory on nodes 0 and 128 was left fully present. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id B8B2B6B0032 for ; Tue, 13 Aug 2013 16:24:32 -0400 (EDT) Received: by mail-ob0-f172.google.com with SMTP id er7so11086542obc.3 for ; Tue, 13 Aug 2013 13:24:31 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <520A83B0.40607@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <520A7514.9020008@sgi.com> <520A83B0.40607@sgi.com> Date: Tue, 13 Aug 2013 13:24:31 -0700 Message-ID: Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator From: Yinghai Lu Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Mike Travis Cc: Linus Torvalds , Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Mel Gorman On Tue, Aug 13, 2013 at 12:06 PM, Mike Travis wrote: > > > On 8/13/2013 11:04 AM, Mike Travis wrote: >> >> >> On 8/13/2013 10:51 AM, Linus Torvalds wrote: >>> by the time you can log in. And if it then takes another ten minutes >>> until you have the full 16TB initialized, and some things might be a >>> tad slower early on, does anybody really care? The machine will be up >>> and running with plenty of memory, even if it may not be *all* the >>> memory yet. >> >> Before the patches adding memory took ~45 mins for 16TB and almost 2 hours >> for 32TB. Adding it late sped up early boot but late insertion was still >> very slow, where the full 32TB was still not fully inserted after an hour. >> Doing it in parallel along with the memory hotplug lock per node, we got >> it down to the 10-15 minute range. >> > > FYI, the system at this time had 128 nodes each with 256GB of memory. > About 252GB was inserted into the absent list from nodes 1 .. 126. > Memory on nodes 0 and 128 was left fully present. Can we have one topic about those boot time issues in this year kernel summit? There will be more 32 sockets x86 systems and will have lots of memory, pci chain and cpu cores. current kernel/smp.c::smp_init(), we still have | /* FIXME: This should be done in userspace --RR */ | for_each_present_cpu(cpu) { | if (num_online_cpus() >= setup_max_cpus) | break; | if (!cpu_online(cpu)) | cpu_up(cpu); | } solution would be: 1. delay some memory, pci chain, or cpus cores. 2. or parallel initialize them during booting 3. or parallel add them after booting. Thanks Yinghai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx146.postini.com [74.125.245.146]) by kanga.kvack.org (Postfix) with SMTP id 843556B0032 for ; Tue, 13 Aug 2013 16:37:59 -0400 (EDT) Message-ID: <520A9924.7050301@sgi.com> Date: Tue, 13 Aug 2013 13:37:56 -0700 From: Mike Travis MIME-Version: 1.0 Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <520A7514.9020008@sgi.com> <520A83B0.40607@sgi.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Yinghai Lu Cc: Linus Torvalds , Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Mel Gorman On 8/13/2013 1:24 PM, Yinghai Lu wrote: >> > FYI, the system at this time had 128 nodes each with 256GB of memory. >> > About 252GB was inserted into the absent list from nodes 1 .. 126. >> > Memory on nodes 0 and 128 was left fully present. Actually, I was corrected, it was 256 nodes with 128GB (8 * 16GB dimms - which are just now coming out.) So there were 254 concurrent initialization processes running. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx162.postini.com [74.125.245.162]) by kanga.kvack.org (Postfix) with SMTP id 381296B0034 for ; Tue, 13 Aug 2013 17:35:05 -0400 (EDT) Message-ID: <520AA687.3070303@sgi.com> Date: Tue, 13 Aug 2013 16:35:03 -0500 From: Nathan Zimmer MIME-Version: 1.0 Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <520A7514.9020008@sgi.com> In-Reply-To: <520A7514.9020008@sgi.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mike Travis Cc: Linus Torvalds , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman On 08/13/2013 01:04 PM, Mike Travis wrote: > > On 8/13/2013 10:51 AM, Linus Torvalds wrote: >> by the time you can log in. And if it then takes another ten minutes >> until you have the full 16TB initialized, and some things might be a >> tad slower early on, does anybody really care? The machine will be up >> and running with plenty of memory, even if it may not be *all* the >> memory yet. > Before the patches adding memory took ~45 mins for 16TB and almost 2 hours > for 32TB. Adding it late sped up early boot but late insertion was still > very slow, where the full 32TB was still not fully inserted after an hour. > Doing it in parallel along with the memory hotplug lock per node, we got > it down to the 10-15 minute range. Yes but to get it to the 10-15 minute range I had to change an number of system locks. The system_sleep, the memory_hotplug, zonelist_mutex and there was some general alteration to various wmark routines. Some of those fixes I don't know if they would stand up to proper scrutiny but were quick and dirty hacks to allow for progress. Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx201.postini.com [74.125.245.201]) by kanga.kvack.org (Postfix) with SMTP id ABB0A6B0032 for ; Tue, 13 Aug 2013 19:10:22 -0400 (EDT) Date: Tue, 13 Aug 2013 18:10:20 -0500 From: Nathan Zimmer Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130813231020.GA22667@asylum.americas.sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Mike Travis , Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman On Tue, Aug 13, 2013 at 10:51:37AM -0700, Linus Torvalds wrote: > I realize that benchmarking cares, and yes, I also realize that some > benchmarks actually want to reboot the machine between some runs just > to get repeatability, but if you're benchmarking a 16TB machine I'm > guessing any serious benchmark that actually uses that much memory is > going to take many hours to a few days to run anyway? Having some way > to wait until the memory is all done (which might even be just a silly > shell script that does "ps" and waits for the kernel threads to all go > away) isn't going to kill the benchmark - and the benchmark itself > will then not have to worry about hittinf the "oops, I need to > initialize 2GB of RAM now because I hit an uninitialized page". > I am not overly concerned with cost having to setup a page struct on first touch but what I need to avoid is adding more permanent cost to page faults on a system that is already "primed". > Ok, so I don't know all the issues, and in many ways I don't even > really care. You could do it other ways, I don't think this is a big > deal. The part I hate is the runtime hook into the core MM page > allocation code, so I'm just throwing out any random thing that comes > to my mind that could be used to avoid that part. > The only mm structure we are adding to is a new flag in page->flags. That didn't seem too much. I had hoped to restrict the core mm changes to check_new_page and free_pages_check but I haven't gotten there yet. Not putting on uninitialized pages on to the lru would work but then I would be concerned over any calculations based on totalpages. I might be too paranoid there but having that be incorrect until after a system is booted worries me. Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx155.postini.com [74.125.245.155]) by kanga.kvack.org (Postfix) with SMTP id EFE5C6B0032 for ; Tue, 13 Aug 2013 19:55:22 -0400 (EDT) Received: by mail-vc0-f171.google.com with SMTP id ij15so4565650vcb.16 for ; Tue, 13 Aug 2013 16:55:22 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20130813231020.GA22667@asylum.americas.sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <20130813231020.GA22667@asylum.americas.sgi.com> Date: Tue, 13 Aug 2013 16:55:21 -0700 Message-ID: Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator From: Linus Torvalds Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Nathan Zimmer Cc: Mike Travis , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman On Tue, Aug 13, 2013 at 4:10 PM, Nathan Zimmer wrote: > > The only mm structure we are adding to is a new flag in page->flags. > That didn't seem too much. I don't agree. I see only downsides, and no upsides. Doing the same thing *without* the downsides seems straightforward, so I simply see no reason for any extra flags or tests at runtime. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx106.postini.com [74.125.245.106]) by kanga.kvack.org (Postfix) with SMTP id 337EB6B0095 for ; Wed, 14 Aug 2013 07:06:05 -0400 (EDT) Received: by mail-ea0-f169.google.com with SMTP id z7so4798820eaf.28 for ; Wed, 14 Aug 2013 04:06:03 -0700 (PDT) Date: Wed, 14 Aug 2013 13:05:56 +0200 From: Ingo Molnar Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130814110556.GH10849@gmail.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Mike Travis , Nathan Zimmer , Peter Anvin , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman * Linus Torvalds wrote: > [...] > > Ok, so I don't know all the issues, and in many ways I don't even really > care. You could do it other ways, I don't think this is a big deal. The > part I hate is the runtime hook into the core MM page allocation code, > so I'm just throwing out any random thing that comes to my mind that > could be used to avoid that part. So, my hope was that it's possible to have a single, simple, zero-cost runtime check [zero cost for already initialized pages], because it can be merged into already existing page flag mask checks present here and executed for every freshly allocated page: static inline int check_new_page(struct page *page) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; } return 0; } We already run this for every new page allocated and the initialization check could hide in PAGE_FLAGS_CHECK_AT_PREP in a zero-cost fashion. I'd not do any of the ensure_page_is_initialized() or __expand_page_initialization() complications in this patch-set - each page head represents itself and gets iterated when check_new_page() is done. During regular bootup we'd initialize like before, except we don't set up the page heads but memset() them to zero. With each page head 32 bytes this would mean 8 GB of page head memory to clear per 1 TB - with 16 TB that's 128 GB to clear - that ought to be possible to do rather quickly, perhaps with some smart SMP cross-call approach that makes sure that each memset is done in a node-local fashion. [*] Such an approach should IMO be far smaller and less invasive than the patches presented so far: it should be below 100 lines or so. I don't know why there's such a big difference between the theory I outlined and the invasive patch-set implemented so far in practice, perhaps I'm missing some complication. I was trying to probe that difference, before giving up on the idea and punting back to the async hotplug-ish approach which would obviously work well too. All in one, I think async init just hides the real problem - there's no way memory init should take this long. Thanks, Ingo [*] alternatively maybe the main performance problem is that node-local memory is set up on a remote (boot) node? In that case I'd try to optimize it by migrating the memory init code's current node by using set_cpus_allowed() to live migrate from node to node, tracking the node whose struct page array is being initialized. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx125.postini.com [74.125.245.125]) by kanga.kvack.org (Postfix) with SMTP id 1777C6B0096 for ; Wed, 14 Aug 2013 07:27:46 -0400 (EDT) Received: by mail-ee0-f46.google.com with SMTP id c13so4769428eek.5 for ; Wed, 14 Aug 2013 04:27:44 -0700 (PDT) Date: Wed, 14 Aug 2013 13:27:41 +0200 From: Ingo Molnar Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130814112741.GB13772@gmail.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <20130813231020.GA22667@asylum.americas.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Nathan Zimmer , Mike Travis , Peter Anvin , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman * Linus Torvalds wrote: > On Tue, Aug 13, 2013 at 4:10 PM, Nathan Zimmer wrote: > > > > The only mm structure we are adding to is a new flag in page->flags. > > That didn't seem too much. > > I don't agree. > > I see only downsides, and no upsides. Doing the same thing *without* the > downsides seems straightforward, so I simply see no reason for any extra > flags or tests at runtime. The code as presented clearly looks more involved and neither simple nor zero-cost - I was hoping for a much more simple approach. I see three solutions: - Speed up the synchronous memory init code: live migrate to the node being set up via set_cpus_allowed(), to make sure the init is always fast and local. Pros: if it solves the problem then mem init is still synchronous, deterministic and essentially equivalent to what we do today - so relatively simple and well-tested, with no 'large machine' special path. Cons: it might not be enough and we might not have scheduling enabled on the affected nodes yet. - Speed up the synchronous memory init code by paralellizing the key, most expensive initialization portion of setting up the page head arrays to per node, via SMP function-calls. Pros: by far the fastest synchronous option. (It will also test the power budget and the mains fuses right during bootup.) Cons: more complex and depends on SMP cross-calls being available at mem init time. Not necessarily hotplug friendly. - Avoid the problem by punting to async mem init. Pros: it gets us to a minimal working system quickly and leaves the memory code relatively untouched. Disadvantages: makes memory state asynchronous and non-deterministic. Stats either fluctuate shortly after bootup or have to be faked. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx180.postini.com [74.125.245.180]) by kanga.kvack.org (Postfix) with SMTP id DC7E06B0032 for ; Wed, 14 Aug 2013 18:15:08 -0400 (EDT) Date: Wed, 14 Aug 2013 17:15:06 -0500 From: Nathan Zimmer Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130814221505.GA147490@asylum.americas.sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <20130814110556.GH10849@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130814110556.GH10849@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Linus Torvalds , Mike Travis , Nathan Zimmer , Peter Anvin , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman On Wed, Aug 14, 2013 at 01:05:56PM +0200, Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > [...] > > > > Ok, so I don't know all the issues, and in many ways I don't even really > > care. You could do it other ways, I don't think this is a big deal. The > > part I hate is the runtime hook into the core MM page allocation code, > > so I'm just throwing out any random thing that comes to my mind that > > could be used to avoid that part. > > So, my hope was that it's possible to have a single, simple, zero-cost > runtime check [zero cost for already initialized pages], because it can be > merged into already existing page flag mask checks present here and > executed for every freshly allocated page: > > static inline int check_new_page(struct page *page) > { > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > return 1; > } > return 0; > } > > We already run this for every new page allocated and the initialization > check could hide in PAGE_FLAGS_CHECK_AT_PREP in a zero-cost fashion. > > I'd not do any of the ensure_page_is_initialized() or > __expand_page_initialization() complications in this patch-set - each page > head represents itself and gets iterated when check_new_page() is done. > > During regular bootup we'd initialize like before, except we don't set up > the page heads but memset() them to zero. With each page head 32 bytes > this would mean 8 GB of page head memory to clear per 1 TB - with 16 TB > that's 128 GB to clear - that ought to be possible to do rather quickly, > perhaps with some smart SMP cross-call approach that makes sure that each > memset is done in a node-local fashion. [*] > > Such an approach should IMO be far smaller and less invasive than the > patches presented so far: it should be below 100 lines or so. > > I don't know why there's such a big difference between the theory I > outlined and the invasive patch-set implemented so far in practice, > perhaps I'm missing some complication. I was trying to probe that > difference, before giving up on the idea and punting back to the async > hotplug-ish approach which would obviously work well too. > The reason, which I failed to mention, is once we pull off a page the lru in either __rmqueue_fallback or __rmqueue_smallest the first thing we do with it is expand() or sometimes move_freepages(). These then trip over some BUG_ON and VM_BUG_ON. Those BUG_ONs are what keep causing me to delve into the ensure/expand foolishness. Nate -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx184.postini.com [74.125.245.184]) by kanga.kvack.org (Postfix) with SMTP id 3EE956B0032 for ; Fri, 16 Aug 2013 12:36:46 -0400 (EDT) Message-ID: <520E5517.9070606@intel.com> Date: Fri, 16 Aug 2013 09:36:39 -0700 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Nathan Zimmer Cc: hpa@zytor.com, mingo@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Hey Nathan, Could you post your boot timing patches? My machines are much smaller than yours, but I'm curious how things behave here as well. I did some very imprecise timings (strace -t on a telnet attached to the serial console). The 'struct page' initializations take about a minute of boot time for me to do 1TB across 8 NUMA nodes (this is a glueless QPI system[1]). My _quick_ calculations look like it's 2x as fast to initialize node0's memory vs. the other nodes, and boot time is increased by a second for about every 30G of memory we add. So even with nothing else fancy, we could get some serious improvements from just doing the initialization locally. [1] We call anything using pure QPI without any other circuitry for the NUMA interconnects to be "glueless" -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932592Ab3GLCEI (ORCPT ); Thu, 11 Jul 2013 22:04:08 -0400 Received: from relay2.sgi.com ([192.48.179.30]:42361 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932556Ab3GLCEH (ORCPT ); Thu, 11 Jul 2013 22:04:07 -0400 From: Robin Holt To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Subject: [RFC 1/4] memblock: Introduce a for_each_reserved_mem_region iterator. Date: Thu, 11 Jul 2013 21:03:52 -0500 Message-Id: <1373594635-131067-2-git-send-email-holt@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932679Ab3GLCES (ORCPT ); Thu, 11 Jul 2013 22:04:18 -0400 Received: from relay3.sgi.com ([192.48.152.1]:49218 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932650Ab3GLCEP (ORCPT ); Thu, 11 Jul 2013 22:04:15 -0400 From: Robin Holt To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Subject: [RFC 4/4] Sparse initialization of struct page array. Date: Thu, 11 Jul 2013 21:03:55 -0500 Message-Id: <1373594635-131067-5-git-send-email-holt@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/mm.h | 11 +++++ include/linux/page-flags.h | 5 +- mm/nobootmem.c | 5 ++ mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- 4 files changed, 132 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..3de08b5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) __free_page(page); } +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); + +static inline void __reserve_bootmem_page(struct page *page) +{ + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; + phys_addr_t end = start + PAGE_SIZE; + + __reserve_bootmem_region(start, end); +} + static inline void free_reserved_page(struct page *page) { + __reserve_bootmem_page(page); __free_reserved_page(page); adjust_managed_page_count(page, 1); } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..79e8eb7 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 3b512ca..e3a386d 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + __reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) { struct pglist_data *pgdat; + memblock_dump_all(); + for_each_online_pgdat(pgdat) reset_node_lowmem_managed_pages(pgdat); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 635b131..fe51eb9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i #endif } +static void expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int reserved = PageReserved(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2Mib(basepage); + + for( pfn++; pfn < end_pfn; pfn++ ) + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); +} + +void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if(PageUninitialized2Mib(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) +{ + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + ensure_pages_are_initialized(start_pfn, end_pfn); +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + __reserve_bootmem_page(page); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2Mib(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2Mib(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2Mib(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2Mib(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; } @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, unsigned long order; int pages_moved = 0; + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); #ifndef CONFIG_HOLES_IN_ZONE /* * page_zone is not safe to call in this context when @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) return 1; + + if (PageUninitialized2Mib(pfn_to_page(pfn))) + pfn += PTRS_PER_PMD; } return 0; } @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) } } +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, + unsigned long size, int nid) +{ + unsigned long validate_end_pfn = pfn + size; + + if (pfn & (size - 1)) + return 1; + + if (pfn + size >= end_pfn) + return 1; + + while (pfn < validate_end_pfn) + { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + /* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + page = pfn_to_page(pfn); __init_single_page(page, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2Mib(page); + + pfn += pfns; } } @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932644Ab3GLCEP (ORCPT ); Thu, 11 Jul 2013 22:04:15 -0400 Received: from relay2.sgi.com ([192.48.179.30]:42376 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932556Ab3GLCEN (ORCPT ); Thu, 11 Jul 2013 22:04:13 -0400 From: Robin Holt To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Subject: [RFC 3/4] Seperate page initialization into a separate function. Date: Thu, 11 Jul 2013 21:03:54 -0500 Message-Id: <1373594635-131067-4-git-send-email-holt@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/mm_init.c | 2 +- mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ 2 files changed, 45 insertions(+), 32 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c3edb62..635b131 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) +{ + unsigned long pfn = page_to_pfn(page); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + if (reserved) { + SetPageReserved(page); + } else { + ClearPageReserved(page); + set_page_count(page, 0); + } + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, continue; } page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(page, zone, nid, 1); } } -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932446Ab3GLCEM (ORCPT ); Thu, 11 Jul 2013 22:04:12 -0400 Received: from relay3.sgi.com ([192.48.152.1]:49204 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932556Ab3GLCEK (ORCPT ); Thu, 11 Jul 2013 22:04:10 -0400 From: Robin Holt To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Subject: [RFC 2/4] Have __free_pages_memory() free in larger chunks. Date: Thu, 11 Jul 2013 21:03:53 -0500 Message-Id: <1373594635-131067-3-git-send-email-holt@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased to PTRS_PER_PMD. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/nobootmem.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..3b512ca 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); + end_aligned = end & ~((1UL << order) - 1); if (end_aligned <= start_aligned) { for (i = start; i < end; i++) @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) for (i = start; i < start_aligned; i++) __free_pages_bootmem(pfn_to_page(i), 0); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) + for (i = start_aligned; i < end_aligned; i += 1 << order) __free_pages_bootmem(pfn_to_page(i), order); for (i = end_aligned; i < end; i++) -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932555Ab3GLCEG (ORCPT ); Thu, 11 Jul 2013 22:04:06 -0400 Received: from relay3.sgi.com ([192.48.152.1]:49191 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932414Ab3GLCEF (ORCPT ); Thu, 11 Jul 2013 22:04:05 -0400 From: Robin Holt To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Subject: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Date: Thu, 11 Jul 2013 21:03:51 -0500 Message-Id: <1373594635-131067-1-git-send-email-holt@sgi.com> X-Mailer: git-send-email 1.8.2.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We have been working on this since we returned from shutdown and have something to discuss now. We restricted ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. First, I think I want to propose getting rid of the page flag. If I knew of a concrete way to determine that the page has not been initialized, this patch series would look different. If there is no definitive way to determine that the struct page has been initialized aside from checking the entire page struct is zero, then I think I would suggest we change the page flag to indicate the page has been initialized. The heart of the problem as I see it comes from expand(). We nearly always see a first reference to a struct page which is in the middle of the 2MiB region. Due to that access, the unlikely() check that was originally proposed really ends up referencing a different page entirely. We actually did not introduce an unlikely and refactor the patches to make that unlikely inside a static inline function. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. With this patch, we did boot a 16TiB machine. Without the patches, the v3.10 kernel with the same configuration took 407 seconds for free_all_bootmem. With the patches and operating on 2MiB pages instead of 1GiB, it took 26 seconds so performance was improved. I have no feel for how the 1GiB chunk size will perform. I am on vacation for the next three days so I am sorry in advance for my infrequent or non-existant responses. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757191Ab3GLHqC (ORCPT ); Fri, 12 Jul 2013 03:46:02 -0400 Received: from relay2.sgi.com ([192.48.179.30]:39051 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1756798Ab3GLHqB (ORCPT ); Fri, 12 Jul 2013 03:46:01 -0400 Date: Fri, 12 Jul 2013 02:45:58 -0500 From: Robin Holt To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. Message-ID: <20130712074558.GP18798@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-3-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1373594635-131067-3-git-send-email-holt@sgi.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org After sleeping on this, why can't we change __free_pages_bootmem to not take an order, but rather nr_pages? If we did that, then __free_pages_memory could just calculate nr_pages and call __free_pages_bootmem one time. I don't see why any of the callers of __free_pages_bootmem would not easily support that change. Maybe I will work that up as part of a -v2 and see if it boots/runs. At the very least, I think we could change to: static void __init __free_pages_memory(unsigned long start, unsigned long end) { int order; while (start < end) { order = ffs(start); while (start + (1UL << order) > end) order--; __free_pages_bootmem(start, order); start += (1UL << order); } } Robin On Thu, Jul 11, 2013 at 09:03:53PM -0500, Robin Holt wrote: > Currently, when free_all_bootmem() calls __free_pages_memory(), the > number of contiguous pages that __free_pages_memory() passes to the > buddy allocator is limited to BITS_PER_LONG. In order to be able to > free only the first page of a 2MiB chunk, we need that to be increased > to PTRS_PER_PMD. > > Signed-off-by: Robin Holt > Signed-off-by: Nate Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > --- > mm/nobootmem.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index bdd3fa2..3b512ca 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > unsigned long i, start_aligned, end_aligned; > - int order = ilog2(BITS_PER_LONG); > + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); > > - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); > - end_aligned = end & ~(BITS_PER_LONG - 1); > + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); > + end_aligned = end & ~((1UL << order) - 1); > > if (end_aligned <= start_aligned) { > for (i = start; i < end; i++) > @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) > for (i = start; i < start_aligned; i++) > __free_pages_bootmem(pfn_to_page(i), 0); > > - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) > + for (i = start_aligned; i < end_aligned; i += 1 << order) > __free_pages_bootmem(pfn_to_page(i), order); > > for (i = end_aligned; i < end; i++) > -- > 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932362Ab3GLI2H (ORCPT ); Fri, 12 Jul 2013 04:28:07 -0400 Received: from mail-ea0-f175.google.com ([209.85.215.175]:52564 "EHLO mail-ea0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757284Ab3GLI2B (ORCPT ); Fri, 12 Jul 2013 04:28:01 -0400 Date: Fri, 12 Jul 2013 10:27:56 +0200 From: Ingo Molnar To: Robin Holt , Borislav Petkov , Robert Richter Cc: "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130712082756.GA4328@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Robin Holt wrote: > [...] > > With this patch, we did boot a 16TiB machine. Without the patches, the > v3.10 kernel with the same configuration took 407 seconds for > free_all_bootmem. With the patches and operating on 2MiB pages instead > of 1GiB, it took 26 seconds so performance was improved. I have no feel > for how the 1GiB chunk size will perform. That's pretty impressive. It's still a 15x speedup instead of a 512x speedup, so I'd say there's something else being the current bottleneck, besides page init granularity. Can you boot with just a few gigs of RAM and stuff the rest into hotplug memory, and then hot-add that memory? That would allow easy profiling of remaining overhead. Side note: Robert Richter and Boris Petkov are working on 'persistent events' support for perf, which will eventually allow boot time profiling - I'm not sure if the patches and the tooling support is ready enough yet for your purposes. Robert, Boris, the following workflow would be pretty intuitive: - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB - we'd get a single (cycles?) event running once the perf subsystem is up and running, with a sampling frequency of 1 KHz, sending profiling trace events to a sufficiently sized profiling buffer of 16 MB per CPU. - once the system reaches SYSTEM_RUNNING, profiling is stopped either automatically - or the user stops it via a new tooling command. - the profiling buffer is extracted into a regular perf.data via a special 'perf record' call or some other, new perf tooling solution/variant. [ Alternatively the kernel could attempt to construct a 'virtual' perf.data from the persistent buffer, available via /sys/debug or elsewhere in /sys - just like the kernel constructs a 'virtual' /proc/kcore, etc. That file could be copied or used directly. ] - from that point on this workflow joins the regular profiling workflow: perf report, perf script et al can be used to analyze the resulting boot profile. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757389Ab3GLIrg (ORCPT ); Fri, 12 Jul 2013 04:47:36 -0400 Received: from mail.skyhub.de ([78.46.96.112]:36502 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757296Ab3GLIrc (ORCPT ); Fri, 12 Jul 2013 04:47:32 -0400 Date: Fri, 12 Jul 2013 10:47:12 +0200 From: Borislav Petkov To: Ingo Molnar Cc: Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra Subject: boot tracing Message-ID: <20130712084712.GD24008@pd.tnic> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20130712082756.GA4328@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > Robert Richter and Boris Petkov are working on 'persistent events' > support for perf, which will eventually allow boot time profiling - > I'm not sure if the patches and the tooling support is ready enough > yet for your purposes. Nope, not yet but we're getting there. > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB What does perf=boot mean? I assume boot tracing. If so, does it mean we want to enable *all* tracepoints and collect whatever hits us? What makes more sense to me is to hijack what the function tracer does - i.e. simply collect all function calls. > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. Right, what would those trace events be? > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. Ok. > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] Yeah, that. > - from that point on this workflow joins the regular profiling workflow: > perf report, perf script et al can be used to analyze the resulting > boot profile. Agreed. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757825Ab3GLIxs (ORCPT ); Fri, 12 Jul 2013 04:53:48 -0400 Received: from mail-ee0-f47.google.com ([74.125.83.47]:62402 "EHLO mail-ee0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757809Ab3GLIxp (ORCPT ); Fri, 12 Jul 2013 04:53:45 -0400 Date: Fri, 12 Jul 2013 10:53:41 +0200 From: Ingo Molnar To: Borislav Petkov Cc: Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra Subject: Re: boot tracing Message-ID: <20130712085341.GC4328@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> <20130712084712.GD24008@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130712084712.GD24008@pd.tnic> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Borislav Petkov wrote: > On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > Robert Richter and Boris Petkov are working on 'persistent events' > > support for perf, which will eventually allow boot time profiling - > > I'm not sure if the patches and the tooling support is ready enough > > yet for your purposes. > > Nope, not yet but we're getting there. > > > Robert, Boris, the following workflow would be pretty intuitive: > > > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > What does perf=boot mean? I assume boot tracing. In this case it would mean boot profiling - i.e. a cycles hardware-PMU event collecting into a perf trace buffer as usual. Essentially a 'perf record -a' work-alike, just one that gets activated as early as practical, and which would allow the profiling of memory initialization. Now, one extra complication here is that to be able to profile buddy allocator this persistent event would have to work before the buddy allocator is active :-/ So this sort of profiling would have to use memblock_alloc(). Just wanted to highlight this usecase, we might eventually want to support it. [ Note that this is different from boot tracing of one or more trace events - but it's a conceptually pretty close cousin. ] Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932594Ab3GLJTU (ORCPT ); Fri, 12 Jul 2013 05:19:20 -0400 Received: from mail-wg0-f44.google.com ([74.125.82.44]:61941 "EHLO mail-wg0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932251Ab3GLJTS (ORCPT ); Fri, 12 Jul 2013 05:19:18 -0400 Date: Fri, 12 Jul 2013 10:19:09 +0100 From: Robert Richter To: Ingo Molnar Cc: Robin Holt , Borislav Petkov , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130712091909.GC8731@rric.localhost> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130712082756.GA4328@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12.07.13 10:27:56, Ingo Molnar wrote: > > * Robin Holt wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. > > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. > > Side note: > > Robert Richter and Boris Petkov are working on 'persistent events' support > for perf, which will eventually allow boot time profiling - I'm not sure > if the patches and the tooling support is ready enough yet for your > purposes. The latest patch set is still this: git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v2 It requires the perf subsystem to be initialized first which might be too late, see perf_event_init() in start_kernel(). The patch set is currently also limited to tracepoints only. If this is sufficient for you, you might register persistent events with the function perf_add_persistent_event_by_id(), see mcheck_init_tp() how to do this. Later you can fetch all samples with: # perf record -e persistent// sleep 1 > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. I am not sure about the event you want to setup here, if it is a tracepoint the sample_period should be always 1. The buffer size parameter looks interesting, for now it is 512kB per cpu per default (as perf tools setup the buffer). > > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. > > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. See the perf-record command above... > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] > > - from that point on this workflow joins the regular profiling workflow: > perf report, perf script et al can be used to analyze the resulting > boot profile. Ingo, thanks for outlining this workflow. We will look how this could fit into the new version of persistent events we currently working on. Thanks, -Robert From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758100Ab3GMDGz (ORCPT ); Fri, 12 Jul 2013 23:06:55 -0400 Received: from mail-ie0-f175.google.com ([209.85.223.175]:59724 "EHLO mail-ie0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758078Ab3GMDGy (ORCPT ); Fri, 12 Jul 2013 23:06:54 -0400 MIME-Version: 1.0 In-Reply-To: <1373594635-131067-4-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-4-git-send-email-holt@sgi.com> Date: Fri, 12 Jul 2013 20:06:52 -0700 X-Google-Sender-Auth: WO3CJRJl2LYGD8i5-oXb9TJwlNQ Message-ID: Subject: Re: [RFC 3/4] Seperate page initialization into a separate function. From: Yinghai Lu To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > Currently, memmap_init_zone() has all the smarts for initializing a > single page. When we convert to initializing pages in a 2MiB chunk, > we will need to do this equivalent work from two separate places > so we are breaking out a helper function. > > Signed-off-by: Robin Holt > Signed-off-by: Nate Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > --- > mm/mm_init.c | 2 +- > mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ > 2 files changed, 45 insertions(+), 32 deletions(-) > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index c280a02..be8a539 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) > BUG_ON(or_mask != add_mask); > } > > -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, > +void mminit_verify_page_links(struct page *page, enum zone_type zone, > unsigned long nid, unsigned long pfn) > { > BUG_ON(page_to_nid(page) != nid); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c3edb62..635b131 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > spin_unlock(&zone->lock); > } > > +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) > +{ > + unsigned long pfn = page_to_pfn(page); > + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > + > + set_page_links(page, zone, nid, pfn); > + mminit_verify_page_links(page, zone, nid, pfn); > + init_page_count(page); > + page_mapcount_reset(page); > + page_nid_reset_last(page); > + if (reserved) { > + SetPageReserved(page); > + } else { > + ClearPageReserved(page); > + set_page_count(page, 0); > + } > + /* > + * Mark the block movable so that blocks are reserved for > + * movable at startup. This will force kernel allocations > + * to reserve their blocks rather than leaking throughout > + * the address space during boot when many long-lived > + * kernel allocations are made. Later some blocks near > + * the start are marked MIGRATE_RESERVE by > + * setup_zone_migrate_reserve() > + * > + * bitmap is created for zone's valid pfn range. but memmap > + * can be created for invalid pages (for alignment) > + * check here not to call set_pageblock_migratetype() against > + * pfn out of zone. > + */ > + if ((z->zone_start_pfn <= pfn) > + && (pfn < zone_end_pfn(z)) > + && !(pfn & (pageblock_nr_pages - 1))) > + set_pageblock_migratetype(page, MIGRATE_MOVABLE); > + > + INIT_LIST_HEAD(&page->lru); > +#ifdef WANT_PAGE_VIRTUAL > + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > + if (!is_highmem_idx(zone)) > + set_page_address(page, __va(pfn << PAGE_SHIFT)); > +#endif > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > continue; > } > page = pfn_to_page(pfn); > - set_page_links(page, zone, nid, pfn); > - mminit_verify_page_links(page, zone, nid, pfn); > - init_page_count(page); > - page_mapcount_reset(page); > - page_nid_reset_last(page); > - SetPageReserved(page); > - /* > - * Mark the block movable so that blocks are reserved for > - * movable at startup. This will force kernel allocations > - * to reserve their blocks rather than leaking throughout > - * the address space during boot when many long-lived > - * kernel allocations are made. Later some blocks near > - * the start are marked MIGRATE_RESERVE by > - * setup_zone_migrate_reserve() > - * > - * bitmap is created for zone's valid pfn range. but memmap > - * can be created for invalid pages (for alignment) > - * check here not to call set_pageblock_migratetype() against > - * pfn out of zone. > - */ > - if ((z->zone_start_pfn <= pfn) > - && (pfn < zone_end_pfn(z)) > - && !(pfn & (pageblock_nr_pages - 1))) > - set_pageblock_migratetype(page, MIGRATE_MOVABLE); > - > - INIT_LIST_HEAD(&page->lru); > -#ifdef WANT_PAGE_VIRTUAL > - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > - if (!is_highmem_idx(zone)) > - set_page_address(page, __va(pfn << PAGE_SHIFT)); > -#endif > + __init_single_page(page, zone, nid, 1); Can you move page = pfn_to_page(pfn) into __init_single_page and pass pfn directly? Yinghai From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758104Ab3GMDI5 (ORCPT ); Fri, 12 Jul 2013 23:08:57 -0400 Received: from mail-ie0-f179.google.com ([209.85.223.179]:59030 "EHLO mail-ie0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758078Ab3GMDI4 (ORCPT ); Fri, 12 Jul 2013 23:08:56 -0400 MIME-Version: 1.0 In-Reply-To: <20130712074558.GP18798@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-3-git-send-email-holt@sgi.com> <20130712074558.GP18798@sgi.com> Date: Fri, 12 Jul 2013 20:08:56 -0700 X-Google-Sender-Auth: PxdDMSeY6XHJZt9t1MfZIWiVD4A Message-ID: Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. From: Yinghai Lu To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 12, 2013 at 12:45 AM, Robin Holt wrote: > At the very least, I think we could change to: > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > int order; > > while (start < end) { > order = ffs(start); > > while (start + (1UL << order) > end) > order--; > > __free_pages_bootmem(start, order); > > start += (1UL << order); > } > } should work, but need to make sure order < MAX_ORDER. Yinghai From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751152Ab3GMETP (ORCPT ); Sat, 13 Jul 2013 00:19:15 -0400 Received: from mail-ie0-f170.google.com ([209.85.223.170]:64446 "EHLO mail-ie0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750858Ab3GMETN (ORCPT ); Sat, 13 Jul 2013 00:19:13 -0400 MIME-Version: 1.0 In-Reply-To: <1373594635-131067-5-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> Date: Fri, 12 Jul 2013 21:19:12 -0700 X-Google-Sender-Auth: c3O0N7wTphRrehIWUMq7keEois0 Message-ID: Subject: Re: [RFC 4/4] Sparse initialization of struct page array. From: Yinghai Lu To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > During boot of large memory machines, a significant portion of boot > is spent initializing the struct page array. The vast majority of > those pages are not referenced during boot. > > Change this over to only initializing the pages when they are > actually allocated. > > Besides the advantage of boot speed, this allows us the chance to > use normal performance monitoring tools to determine where the bulk > of time is spent during page initialization. > > Signed-off-by: Robin Holt > Signed-off-by: Nate Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > --- > include/linux/mm.h | 11 +++++ > include/linux/page-flags.h | 5 +- > mm/nobootmem.c | 5 ++ > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > 4 files changed, 132 insertions(+), 6 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..3de08b5 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > __free_page(page); > } > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > + > +static inline void __reserve_bootmem_page(struct page *page) > +{ > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > + phys_addr_t end = start + PAGE_SIZE; > + > + __reserve_bootmem_region(start, end); > +} > + > static inline void free_reserved_page(struct page *page) > { > + __reserve_bootmem_page(page); > __free_reserved_page(page); > adjust_managed_page_count(page, 1); > } > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 6d53675..79e8eb7 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -83,6 +83,7 @@ enum pageflags { > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > PG_arch_1, > PG_reserved, > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > PG_private, /* If pagecache, has fs-private data */ > PG_private_2, /* If pagecache, has fs aux data */ > PG_writeback, /* Page is under writeback */ > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > __PAGEFLAG(SlobFree, slob_free) > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > + > /* > * Private page markings that may be used by the filesystem that owns the page > * for its own purposes. > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > #define PAGE_FLAGS_CHECK_AT_FREE \ > (1 << PG_lru | 1 << PG_locked | \ > 1 << PG_private | 1 << PG_private_2 | \ > - 1 << PG_writeback | 1 << PG_reserved | \ > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > __PG_COMPOUND_LOCK) > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 3b512ca..e3a386d 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + __reserve_bootmem_region(start, end); > + How about holes that is not in memblock.reserved? before this patch: free_area_init_node/free_area_init_core/memmap_init_zone will mark all page in node range to Reserved in struct page, at first. but those holes is not mapped via kernel low mapping. so it should be ok not touch "struct page" for them. Now you only mark reserved for memblock.reserved at first, and later mark {memblock.memory} - { memblock.reserved} to be available. And that is ok. but should split that change to another patch and add some comment and change log for the change. in case there is some user like UEFI etc do weird thing. > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > { > struct pglist_data *pgdat; > > + memblock_dump_all(); > + Not needed. > for_each_online_pgdat(pgdat) > reset_node_lowmem_managed_pages(pgdat); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 635b131..fe51eb9 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > #endif > } > > +static void expand_page_initialization(struct page *basepage) > +{ > + unsigned long pfn = page_to_pfn(basepage); > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > + unsigned long zone = page_zonenum(basepage); > + int reserved = PageReserved(basepage); > + int nid = page_to_nid(basepage); > + > + ClearPageUninitialized2Mib(basepage); > + > + for( pfn++; pfn < end_pfn; pfn++ ) > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > +} > + > +void ensure_pages_are_initialized(unsigned long start_pfn, > + unsigned long end_pfn) > +{ > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > + unsigned long aligned_end_pfn; > + struct page *page; > + > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > + aligned_end_pfn += PTRS_PER_PMD; > + while (aligned_start_pfn < aligned_end_pfn) { > + if (pfn_valid(aligned_start_pfn)) { > + page = pfn_to_page(aligned_start_pfn); > + > + if(PageUninitialized2Mib(page)) > + expand_page_initialization(page); > + } > + > + aligned_start_pfn += PTRS_PER_PMD; > + } > +} > + > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > +{ > + unsigned long start_pfn = PFN_DOWN(start); > + unsigned long end_pfn = PFN_UP(end); > + > + ensure_pages_are_initialized(start_pfn, end_pfn); > +} that name is confusing, actually it is setting to struct page to Reserved only. maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? > + > +static inline void ensure_page_is_initialized(struct page *page) > +{ > + __reserve_bootmem_page(page); > +} how about use __reserve_page_bootmem directly and add comment in callers site ? > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > if (PageAnon(page)) > page->mapping = NULL; > for (i = 0; i < (1 << order); i++) > - bad += free_pages_check(page + i); > + if (PageUninitialized2Mib(page + i)) > + i += PTRS_PER_PMD - 1; > + else > + bad += free_pages_check(page + i); > if (bad) > return false; > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > unsigned int loop; > > prefetchw(page); > - for (loop = 0; loop < nr_pages; loop++) { > + for (loop = 0; loop < nr_pages; ) { > struct page *p = &page[loop]; > > if (loop + 1 < nr_pages) > prefetchw(p + 1); > + > + if ((PageUninitialized2Mib(p)) && > + ((loop + PTRS_PER_PMD) > nr_pages)) > + ensure_page_is_initialized(p); > + > __ClearPageReserved(p); > set_page_count(p, 0); > + if (PageUninitialized2Mib(p)) > + loop += PTRS_PER_PMD; > + else > + loop += 1; > } > > page_zone(page)->managed_pages += 1 << order; > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > area--; > high--; > size >>= 1; > + ensure_page_is_initialized(page); > VM_BUG_ON(bad_range(zone, &page[size])); > > #ifdef CONFIG_DEBUG_PAGEALLOC > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > for (i = 0; i < (1 << order); i++) { > struct page *p = page + i; > + > + if (PageUninitialized2Mib(p)) > + expand_page_initialization(page); > + > if (unlikely(check_new_page(p))) > return 1; > } > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > unsigned long order; > int pages_moved = 0; > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > #ifndef CONFIG_HOLES_IN_ZONE > /* > * page_zone is not safe to call in this context when > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > return 1; > + > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > + pfn += PTRS_PER_PMD; > } > return 0; > } > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > } > } > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > + unsigned long size, int nid) why not use static ? it seems there is not outside user. > +{ > + unsigned long validate_end_pfn = pfn + size; > + > + if (pfn & (size - 1)) > + return 1; > + > + if (pfn + size >= end_pfn) > + return 1; > + > + while (pfn < validate_end_pfn) > + { > + if (!early_pfn_valid(pfn)) > + return 1; > + if (!early_pfn_in_nid(pfn, nid)) > + return 1; > + pfn++; > + } > + > + return size; > +} > + > /* > * Initially all pages are reserved - free ones are freed > * up by free_all_bootmem() once the early boot process is > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > highest_memmap_pfn = end_pfn - 1; > > z = &NODE_DATA(nid)->node_zones[zone]; > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > + for (pfn = start_pfn; pfn < end_pfn; ) { > /* > * There can be holes in boot-time mem_map[]s > * handed to this function. They do not > * exist on hotplugged memory. > */ > + int pfns = 1; > if (context == MEMMAP_EARLY) { > - if (!early_pfn_valid(pfn)) > + if (!early_pfn_valid(pfn)) { > + pfn++; > continue; > - if (!early_pfn_in_nid(pfn, nid)) > + } > + if (!early_pfn_in_nid(pfn, nid)) { > + pfn++; > continue; > + } > + > + pfns = pfn_range_init_avail(pfn, end_pfn, > + PTRS_PER_PMD, nid); > } maybe could update memmap_init_zone() only iterate {memblock.memory} - {memblock.reserved}, so you do need to check avail .... as memmap_init_zone do not need to handle holes to mark reserve for them. > + > page = pfn_to_page(pfn); > __init_single_page(page, zone, nid, 1); > + > + if (pfns > 1) > + SetPageUninitialized2Mib(page); > + > + pfn += pfns; > } > } > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > {1UL << PG_owner_priv_1, "owner_priv_1" }, > {1UL << PG_arch_1, "arch_1" }, > {1UL << PG_reserved, "reserved" }, > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, PG_uninitialized_2m ? > {1UL << PG_private, "private" }, > {1UL << PG_private_2, "private_2" }, > {1UL << PG_writeback, "writeback" }, Yinghai From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751078Ab3GMEkF (ORCPT ); Sat, 13 Jul 2013 00:40:05 -0400 Received: from terminus.zytor.com ([198.137.202.10]:44348 "EHLO mail.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750775Ab3GMEkE (ORCPT ); Sat, 13 Jul 2013 00:40:04 -0400 Message-ID: <51E0DA05.4090107@zytor.com> Date: Fri, 12 Jul 2013 21:39:33 -0700 From: "H. Peter Anvin" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7 MIME-Version: 1.0 To: Yinghai Lu CC: Robin Holt , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/12/2013 09:19 PM, Yinghai Lu wrote: >> PG_reserved, >> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >> PG_private, /* If pagecache, has fs-private data */ The comment here is WTF... -hpa From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751267Ab3GMFbS (ORCPT ); Sat, 13 Jul 2013 01:31:18 -0400 Received: from mail-ie0-f180.google.com ([209.85.223.180]:51363 "EHLO mail-ie0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750971Ab3GMFbR (ORCPT ); Sat, 13 Jul 2013 01:31:17 -0400 MIME-Version: 1.0 In-Reply-To: <51E0DA05.4090107@zytor.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <51E0DA05.4090107@zytor.com> Date: Fri, 12 Jul 2013 22:31:16 -0700 X-Google-Sender-Auth: usIi7qxQjb90onbFx8jYv68T4F0 Message-ID: Subject: Re: [RFC 4/4] Sparse initialization of struct page array. From: Yinghai Lu To: "H. Peter Anvin" Cc: Robin Holt , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin wrote: > On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>> PG_reserved, >>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>> PG_private, /* If pagecache, has fs-private data */ > > The comment here is WTF... ntz: Nate Zimmer? rmh: Robin Holt? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751358Ab3GMFjT (ORCPT ); Sat, 13 Jul 2013 01:39:19 -0400 Received: from terminus.zytor.com ([198.137.202.10]:44648 "EHLO mail.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750958Ab3GMFjS (ORCPT ); Sat, 13 Jul 2013 01:39:18 -0400 Message-ID: <51E0E7ED.7040801@zytor.com> Date: Fri, 12 Jul 2013 22:38:53 -0700 From: "H. Peter Anvin" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7 MIME-Version: 1.0 To: Yinghai Lu CC: Robin Holt , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <51E0DA05.4090107@zytor.com> In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/12/2013 10:31 PM, Yinghai Lu wrote: > On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin wrote: >> On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>>> PG_reserved, >>>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>>> PG_private, /* If pagecache, has fs-private data */ >> >> The comment here is WTF... > > ntz: Nate Zimmer? > rmh: Robin Holt? > This kind of conversation doesn't really belong in a code comment, though. -hpa From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753801Ab3GOBiw (ORCPT ); Sun, 14 Jul 2013 21:38:52 -0400 Received: from mail-ie0-f171.google.com ([209.85.223.171]:50854 "EHLO mail-ie0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753762Ab3GOBiv (ORCPT ); Sun, 14 Jul 2013 21:38:51 -0400 Message-ID: <51E3529F.6070909@gmail.com> Date: Mon, 15 Jul 2013 09:38:39 +0800 From: Sam Ben User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130329 Thunderbird/17.0.5 MIME-Version: 1.0 To: Ingo Molnar CC: Borislav Petkov , Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra Subject: Re: boot tracing References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> <20130712084712.GD24008@pd.tnic> <20130712085341.GC4328@gmail.com> In-Reply-To: <20130712085341.GC4328@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/12/2013 04:53 PM, Ingo Molnar wrote: > * Borislav Petkov wrote: > >> On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: >>> Robert Richter and Boris Petkov are working on 'persistent events' >>> support for perf, which will eventually allow boot time profiling - >>> I'm not sure if the patches and the tooling support is ready enough >>> yet for your purposes. >> Nope, not yet but we're getting there. >> >>> Robert, Boris, the following workflow would be pretty intuitive: >>> >>> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB >> What does perf=boot mean? I assume boot tracing. > In this case it would mean boot profiling - i.e. a cycles hardware-PMU > event collecting into a perf trace buffer as usual. > > Essentially a 'perf record -a' work-alike, just one that gets activated as > early as practical, and which would allow the profiling of memory > initialization. > > Now, one extra complication here is that to be able to profile buddy > allocator this persistent event would have to work before the buddy > allocator is active :-/ So this sort of profiling would have to use > memblock_alloc(). Could perf=boot be used to sample the performance of memblock subsystem? I think the perf subsystem is too late to be initialized and monitor this. > > Just wanted to highlight this usecase, we might eventually want to support > it. > > [ Note that this is different from boot tracing of one or more trace > events - but it's a conceptually pretty close cousin. ] > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753924Ab3GODTh (ORCPT ); Sun, 14 Jul 2013 23:19:37 -0400 Received: from relay3.sgi.com ([192.48.152.1]:52949 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753871Ab3GODTf (ORCPT ); Sun, 14 Jul 2013 23:19:35 -0400 Date: Sun, 14 Jul 2013 22:19:33 -0500 From: Robin Holt To: Yinghai Lu Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 3/4] Seperate page initialization into a separate function. Message-ID: <20130715031932.GA31581@gulag1.americas.sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-4-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 12, 2013 at 08:06:52PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > > Currently, memmap_init_zone() has all the smarts for initializing a > > single page. When we convert to initializing pages in a 2MiB chunk, > > we will need to do this equivalent work from two separate places > > so we are breaking out a helper function. > > > > Signed-off-by: Robin Holt > > Signed-off-by: Nate Zimmer > > To: "H. Peter Anvin" > > To: Ingo Molnar > > Cc: Linux Kernel > > Cc: Linux MM > > Cc: Rob Landley > > Cc: Mike Travis > > Cc: Daniel J Blueman > > Cc: Andrew Morton > > Cc: Greg KH > > Cc: Yinghai Lu > > Cc: Mel Gorman > > --- > > mm/mm_init.c | 2 +- > > mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------ > > 2 files changed, 45 insertions(+), 32 deletions(-) > > > > diff --git a/mm/mm_init.c b/mm/mm_init.c > > index c280a02..be8a539 100644 > > --- a/mm/mm_init.c > > +++ b/mm/mm_init.c > > @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) > > BUG_ON(or_mask != add_mask); > > } > > > > -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, > > +void mminit_verify_page_links(struct page *page, enum zone_type zone, > > unsigned long nid, unsigned long pfn) > > { > > BUG_ON(page_to_nid(page) != nid); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index c3edb62..635b131 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > > spin_unlock(&zone->lock); > > } > > > > +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved) > > +{ > > + unsigned long pfn = page_to_pfn(page); > > + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > > + > > + set_page_links(page, zone, nid, pfn); > > + mminit_verify_page_links(page, zone, nid, pfn); > > + init_page_count(page); > > + page_mapcount_reset(page); > > + page_nid_reset_last(page); > > + if (reserved) { > > + SetPageReserved(page); > > + } else { > > + ClearPageReserved(page); > > + set_page_count(page, 0); > > + } > > + /* > > + * Mark the block movable so that blocks are reserved for > > + * movable at startup. This will force kernel allocations > > + * to reserve their blocks rather than leaking throughout > > + * the address space during boot when many long-lived > > + * kernel allocations are made. Later some blocks near > > + * the start are marked MIGRATE_RESERVE by > > + * setup_zone_migrate_reserve() > > + * > > + * bitmap is created for zone's valid pfn range. but memmap > > + * can be created for invalid pages (for alignment) > > + * check here not to call set_pageblock_migratetype() against > > + * pfn out of zone. > > + */ > > + if ((z->zone_start_pfn <= pfn) > > + && (pfn < zone_end_pfn(z)) > > + && !(pfn & (pageblock_nr_pages - 1))) > > + set_pageblock_migratetype(page, MIGRATE_MOVABLE); > > + > > + INIT_LIST_HEAD(&page->lru); > > +#ifdef WANT_PAGE_VIRTUAL > > + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > > + if (!is_highmem_idx(zone)) > > + set_page_address(page, __va(pfn << PAGE_SHIFT)); > > +#endif > > +} > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > continue; > > } > > page = pfn_to_page(pfn); > > - set_page_links(page, zone, nid, pfn); > > - mminit_verify_page_links(page, zone, nid, pfn); > > - init_page_count(page); > > - page_mapcount_reset(page); > > - page_nid_reset_last(page); > > - SetPageReserved(page); > > - /* > > - * Mark the block movable so that blocks are reserved for > > - * movable at startup. This will force kernel allocations > > - * to reserve their blocks rather than leaking throughout > > - * the address space during boot when many long-lived > > - * kernel allocations are made. Later some blocks near > > - * the start are marked MIGRATE_RESERVE by > > - * setup_zone_migrate_reserve() > > - * > > - * bitmap is created for zone's valid pfn range. but memmap > > - * can be created for invalid pages (for alignment) > > - * check here not to call set_pageblock_migratetype() against > > - * pfn out of zone. > > - */ > > - if ((z->zone_start_pfn <= pfn) > > - && (pfn < zone_end_pfn(z)) > > - && !(pfn & (pageblock_nr_pages - 1))) > > - set_pageblock_migratetype(page, MIGRATE_MOVABLE); > > - > > - INIT_LIST_HEAD(&page->lru); > > -#ifdef WANT_PAGE_VIRTUAL > > - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ > > - if (!is_highmem_idx(zone)) > > - set_page_address(page, __va(pfn << PAGE_SHIFT)); > > -#endif > > + __init_single_page(page, zone, nid, 1); > > Can you > move page = pfn_to_page(pfn) into __init_single_page > and pass pfn directly? Sure, but then I don't care for the name so much, but I can think on that some too. I think the feedback I was most hoping to receive was pertaining to a means for removing the PG_uninitialized2Mib flag entirely. If I could get rid of that and have some page-local way of knowing if it has not been initialized, I think this patch set would be much better. Robin From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757274Ab3GOOIy (ORCPT ); Mon, 15 Jul 2013 10:08:54 -0400 Received: from relay1.sgi.com ([192.48.179.29]:54666 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757218Ab3GOOIw (ORCPT ); Mon, 15 Jul 2013 10:08:52 -0400 Message-ID: <51E40272.6000806@sgi.com> Date: Mon, 15 Jul 2013 09:08:50 -0500 From: Nathan Zimmer User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130623 Thunderbird/17.0.7 MIME-Version: 1.0 To: Yinghai Lu CC: "H. Peter Anvin" , Robin Holt , Ingo Molnar , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <51E0DA05.4090107@zytor.com> In-Reply-To: Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [128.162.233.140] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/13/2013 12:31 AM, Yinghai Lu wrote: > On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin wrote: >> On 07/12/2013 09:19 PM, Yinghai Lu wrote: >>>> PG_reserved, >>>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ >>>> PG_private, /* If pagecache, has fs-private data */ >> The comment here is WTF... > ntz: Nate Zimmer? > rmh: Robin Holt? Yea that comment was supposed to be removed. Sorry about that. Nate From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932432Ab3GOPAm (ORCPT ); Mon, 15 Jul 2013 11:00:42 -0400 Received: from relay1.sgi.com ([192.48.179.29]:39606 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932092Ab3GOPAl (ORCPT ); Mon, 15 Jul 2013 11:00:41 -0400 Date: Mon, 15 Jul 2013 10:00:40 -0500 From: Robin Holt To: "H. Peter Anvin" , Ingo Molnar Cc: Robin Holt , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130715150040.GA3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 11, 2013 at 09:03:51PM -0500, Robin Holt wrote: > We have been working on this since we returned from shutdown and have > something to discuss now. We restricted ourselves to 2MiB initialization > to keep the patch set a little smaller and more clear. > > First, I think I want to propose getting rid of the page flag. If I knew > of a concrete way to determine that the page has not been initialized, > this patch series would look different. If there is no definitive > way to determine that the struct page has been initialized aside from > checking the entire page struct is zero, then I think I would suggest > we change the page flag to indicate the page has been initialized. Ingo or HPA, Did I implement this wrong or is there a way to get rid of the page flag which is not going to impact normal operation? I don't want to put too much more effort into this until I know we are stuck going this direction. Currently, the expand() function has a relatively expensive checked against the 2MiB aligned pfn's struct page. I do not know of a way to eliminate that check against the other page as the first reference we see for a page is in the middle of that 2MiB aligned range. To identify this as an area of concern, we had booted with a simulator, setting watch points on the struct page array region once the Uninitialized flag was set and maintaining that until it was cleared. Thanks, Robin From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932620Ab3GOPQ0 (ORCPT ); Mon, 15 Jul 2013 11:16:26 -0400 Received: from relay2.sgi.com ([192.48.179.30]:47640 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932364Ab3GOPQZ (ORCPT ); Mon, 15 Jul 2013 11:16:25 -0400 Date: Mon, 15 Jul 2013 10:16:23 -0500 From: Robin Holt To: Ingo Molnar Cc: Robin Holt , Borislav Petkov , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130715151623.GB3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130712082756.GA4328@gmail.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > * Robin Holt wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. And WRONG! That is a 15x speedup in the freeing of memory at the free_all_bootmem point. That is _NOT_ the speedup from memmap_init_zone. I forgot to take that into account as Nate pointed out this morning in a hallway discussion. Before, on the 16TiB machine, memmap_init_zone took 1152 seconds. After, it took 50. If it were a straight 1/512th, we would have expected that 1152 to be something more on the line of 2-3 so there is still significant room for improvement. Sorry for the confusion. > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. Nate and I will be working on other things for the next few hours hoping there is a better answer to the first question we asked about there being a way to test a page other than comparing against all zeroes to see if it has been initialized. Thanks, Robin From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752692Ab3GORp7 (ORCPT ); Mon, 15 Jul 2013 13:45:59 -0400 Received: from relay3.sgi.com ([192.48.152.1]:34635 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752016Ab3GORp5 (ORCPT ); Mon, 15 Jul 2013 13:45:57 -0400 Date: Mon, 15 Jul 2013 12:45:55 -0500 From: Nathan Zimmer To: Yinghai Lu Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130715174551.GA58640@asylum.americas.sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? > > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai I hadn't actually been very happy with having a PG_uninitialized2mib flag. It implies if we want to jump to 1Gb pages we would need a second flag, PG_uninitialized1gb, for that. I was thinking of changing it to PG_uninitialized and setting page->private to the correct order. Thoughts? Nate From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753519Ab3GORzA (ORCPT ); Mon, 15 Jul 2013 13:55:00 -0400 Received: from terminus.zytor.com ([198.137.202.10]:43229 "EHLO mail.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752441Ab3GORy7 (ORCPT ); Mon, 15 Jul 2013 13:54:59 -0400 Message-ID: <51E4375E.1010704@zytor.com> Date: Mon, 15 Jul 2013 10:54:38 -0700 From: "H. Peter Anvin" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7 MIME-Version: 1.0 To: Nathan Zimmer CC: Yinghai Lu , Robin Holt , Ingo Molnar , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> In-Reply-To: <20130715174551.GA58640@asylum.americas.sgi.com> X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/15/2013 10:45 AM, Nathan Zimmer wrote: > > I hadn't actually been very happy with having a PG_uninitialized2mib flag. > It implies if we want to jump to 1Gb pages we would need a second flag, > PG_uninitialized1gb, for that. I was thinking of changing it to > PG_uninitialized and setting page->private to the correct order. > Thoughts? > Seems straightforward. The bigger issue is the amount of overhead we cause by having to check upstack for the initialization status of the superpages. I'm concerned, obviously, about lingering overhead that is "forever". That being said, in the absolutely worst case we could have a counter to the number of uninitialized pages which when it hits zero we do a static switch and switch out the initialization code (would have to be undone on memory hotplug, of course.) -hpa From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754155Ab3GOS0S (ORCPT ); Mon, 15 Jul 2013 14:26:18 -0400 Received: from relay2.sgi.com ([192.48.179.30]:58793 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753580Ab3GOS0R (ORCPT ); Mon, 15 Jul 2013 14:26:17 -0400 Date: Mon, 15 Jul 2013 13:26:15 -0500 From: Robin Holt To: "H. Peter Anvin" Cc: Nathan Zimmer , Yinghai Lu , Robin Holt , Ingo Molnar , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130715182615.GF3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51E4375E.1010704@zytor.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 15, 2013 at 10:54:38AM -0700, H. Peter Anvin wrote: > On 07/15/2013 10:45 AM, Nathan Zimmer wrote: > > > > I hadn't actually been very happy with having a PG_uninitialized2mib flag. > > It implies if we want to jump to 1Gb pages we would need a second flag, > > PG_uninitialized1gb, for that. I was thinking of changing it to > > PG_uninitialized and setting page->private to the correct order. > > Thoughts? > > > > Seems straightforward. The bigger issue is the amount of overhead we > cause by having to check upstack for the initialization status of the > superpages. > > I'm concerned, obviously, about lingering overhead that is "forever". > That being said, in the absolutely worst case we could have a counter to > the number of uninitialized pages which when it hits zero we do a static > switch and switch out the initialization code (would have to be undone > on memory hotplug, of course.) Is there a fairly cheap way to determine definitively that the struct page is not initialized? I think this patch set can change fairly drastically if we have that. I think I will start working up those changes and code a heavy-handed check until I hear of an alternative way to cheaply check. Thanks, Robin From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754044Ab3GOS3w (ORCPT ); Mon, 15 Jul 2013 14:29:52 -0400 Received: from terminus.zytor.com ([198.137.202.10]:43534 "EHLO mail.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753082Ab3GOS3v (ORCPT ); Mon, 15 Jul 2013 14:29:51 -0400 Message-ID: <51E43F91.1040906@zytor.com> Date: Mon, 15 Jul 2013 11:29:37 -0700 From: "H. Peter Anvin" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7 MIME-Version: 1.0 To: Robin Holt CC: Nathan Zimmer , Yinghai Lu , Ingo Molnar , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> In-Reply-To: <20130715182615.GF3421@sgi.com> X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/15/2013 11:26 AM, Robin Holt wrote: > Is there a fairly cheap way to determine definitively that the struct > page is not initialized? By definition I would assume no. The only way I can think of would be to unmap the memory associated with the struct page in the TLB and initialize the struct pages at trap time. > I think this patch set can change fairly drastically if we have that. > I think I will start working up those changes and code a heavy-handed > check until I hear of an alternative way to cheaply check. -hpa From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757634Ab3GOVak (ORCPT ); Mon, 15 Jul 2013 17:30:40 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:49758 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755209Ab3GOVaj (ORCPT ); Mon, 15 Jul 2013 17:30:39 -0400 Date: Mon, 15 Jul 2013 14:30:37 -0700 From: Andrew Morton To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Greg KH , Yinghai Lu , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-Id: <20130715143037.8287ffbf2fb0e72bc8efb287@linux-foundation.org> In-Reply-To: <1373594635-131067-5-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 11 Jul 2013 21:03:55 -0500 Robin Holt wrote: > During boot of large memory machines, a significant portion of boot > is spent initializing the struct page array. The vast majority of > those pages are not referenced during boot. > > Change this over to only initializing the pages when they are > actually allocated. > > Besides the advantage of boot speed, this allows us the chance to > use normal performance monitoring tools to determine where the bulk > of time is spent during page initialization. > > ... > > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > __free_page(page); > } > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > + > +static inline void __reserve_bootmem_page(struct page *page) > +{ > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > + phys_addr_t end = start + PAGE_SIZE; > + > + __reserve_bootmem_region(start, end); > +} It isn't obvious that this needed to be inlined? > static inline void free_reserved_page(struct page *page) > { > + __reserve_bootmem_page(page); > __free_reserved_page(page); > adjust_managed_page_count(page, 1); > } > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 6d53675..79e8eb7 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -83,6 +83,7 @@ enum pageflags { > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > PG_arch_1, > PG_reserved, > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ "mib" creeps me out too. And it makes me think of SNMP, which I'd prefer not to think about. We've traditionally had fears of running out of page flags, but I've lost track of how close we are to that happening. IIRC the answer depends on whether you believe there is such a thing as a 32-bit NUMA system. Can this be avoided anyway? I suspect there's some idiotic combination of flags we could use to indicate the state. PG_reserved|PG_lru or something. "2MB" sounds terribly arch-specific. Shouldn't we make it more generic for when the hexagon64 port wants to use 4MB? That conversational code comment was already commented on, but it's still there? > > ... > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > #endif > } > > +static void expand_page_initialization(struct page *basepage) > +{ > + unsigned long pfn = page_to_pfn(basepage); > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > + unsigned long zone = page_zonenum(basepage); > + int reserved = PageReserved(basepage); > + int nid = page_to_nid(basepage); > + > + ClearPageUninitialized2Mib(basepage); > + > + for( pfn++; pfn < end_pfn; pfn++ ) > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > +} > + > +void ensure_pages_are_initialized(unsigned long start_pfn, > + unsigned long end_pfn) I think this can be made static. I hope so, as it's a somewhat odd-sounding identifier for a global. > +{ > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > + unsigned long aligned_end_pfn; > + struct page *page; > + > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > + aligned_end_pfn += PTRS_PER_PMD; > + while (aligned_start_pfn < aligned_end_pfn) { > + if (pfn_valid(aligned_start_pfn)) { > + page = pfn_to_page(aligned_start_pfn); > + > + if(PageUninitialized2Mib(page)) checkpatch them, please. > + expand_page_initialization(page); > + } > + > + aligned_start_pfn += PTRS_PER_PMD; > + } > +} Some nice code comments for the above two functions would be helpful. > > ... > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > + unsigned long size, int nid) > +{ > + unsigned long validate_end_pfn = pfn + size; > + > + if (pfn & (size - 1)) > + return 1; > + > + if (pfn + size >= end_pfn) > + return 1; > + > + while (pfn < validate_end_pfn) > + { > + if (!early_pfn_valid(pfn)) > + return 1; > + if (!early_pfn_in_nid(pfn, nid)) > + return 1; > + pfn++; > + } > + > + return size; > +} Document it, please. The return value semantics look odd, so don't forget to explain all that as well. > > ... > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > {1UL << PG_owner_priv_1, "owner_priv_1" }, > {1UL << PG_arch_1, "arch_1" }, > {1UL << PG_reserved, "reserved" }, > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, It would be better if the name which is visible in procfs matches the name in the kernel source code. > {1UL << PG_private, "private" }, > {1UL << PG_private_2, "private_2" }, > {1UL << PG_writeback, "writeback" }, From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753939Ab3GPIzE (ORCPT ); Tue, 16 Jul 2013 04:55:04 -0400 Received: from LGEMRELSE6Q.lge.com ([156.147.1.121]:50907 "EHLO LGEMRELSE6Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753351Ab3GPIzA (ORCPT ); Tue, 16 Jul 2013 04:55:00 -0400 X-AuditID: 9c930179-b7c49ae000000e68-91-51e50a62f567 Date: Tue, 16 Jul 2013 17:55:02 +0900 From: Joonsoo Kim To: Ingo Molnar Cc: Robin Holt , Borislav Petkov , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130716085502.GA31276@lge.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130712082756.GA4328@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > > * Robin Holt wrote: > > > [...] > > > > With this patch, we did boot a 16TiB machine. Without the patches, the > > v3.10 kernel with the same configuration took 407 seconds for > > free_all_bootmem. With the patches and operating on 2MiB pages instead > > of 1GiB, it took 26 seconds so performance was improved. I have no feel > > for how the 1GiB chunk size will perform. > > That's pretty impressive. > > It's still a 15x speedup instead of a 512x speedup, so I'd say there's > something else being the current bottleneck, besides page init > granularity. > > Can you boot with just a few gigs of RAM and stuff the rest into hotplug > memory, and then hot-add that memory? That would allow easy profiling of > remaining overhead. > > Side note: > > Robert Richter and Boris Petkov are working on 'persistent events' support > for perf, which will eventually allow boot time profiling - I'm not sure > if the patches and the tooling support is ready enough yet for your > purposes. > > Robert, Boris, the following workflow would be pretty intuitive: > > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > > - we'd get a single (cycles?) event running once the perf subsystem is up > and running, with a sampling frequency of 1 KHz, sending profiling > trace events to a sufficiently sized profiling buffer of 16 MB per > CPU. > > - once the system reaches SYSTEM_RUNNING, profiling is stopped either > automatically - or the user stops it via a new tooling command. > > - the profiling buffer is extracted into a regular perf.data via a > special 'perf record' call or some other, new perf tooling > solution/variant. > > [ Alternatively the kernel could attempt to construct a 'virtual' > perf.data from the persistent buffer, available via /sys/debug or > elsewhere in /sys - just like the kernel constructs a 'virtual' > /proc/kcore, etc. That file could be copied or used directly. ] Hello, Robert, Boris, Ingo. How about executing a perf in usermodehelper and collecting output in tmpfs? Using this approach, we can start a perf after rootfs initialization, because we need a perf binary at least. But we can use almost functionality of perf. If anyone have interest with this approach, I will send patches implementing this idea. Thanks. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754768Ab3GPJIc (ORCPT ); Tue, 16 Jul 2013 05:08:32 -0400 Received: from mail.skyhub.de ([78.46.96.112]:56676 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753882Ab3GPJIa (ORCPT ); Tue, 16 Jul 2013 05:08:30 -0400 Date: Tue, 16 Jul 2013 11:08:05 +0200 From: Borislav Petkov To: Joonsoo Kim Cc: Ingo Molnar , Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130716090805.GC4402@pd.tnic> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> <20130716085502.GA31276@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20130716085502.GA31276@lge.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: > How about executing a perf in usermodehelper and collecting output > in tmpfs? Using this approach, we can start a perf after rootfs > initialization, What for if we can start logging to buffers much earlier? *Reading* from those buffers can be done much later, at our own leisure with full userspace up. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754612Ab3GPK0W (ORCPT ); Tue, 16 Jul 2013 06:26:22 -0400 Received: from relay3.sgi.com ([192.48.152.1]:43227 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753355Ab3GPK0T (ORCPT ); Tue, 16 Jul 2013 06:26:19 -0400 Date: Tue, 16 Jul 2013 05:26:15 -0500 From: Robin Holt To: Yinghai Lu Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130716102615.GG3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > Signed-off-by: Robin Holt > > Signed-off-by: Nate Zimmer > > To: "H. Peter Anvin" > > To: Ingo Molnar > > Cc: Linux Kernel > > Cc: Linux MM > > Cc: Rob Landley > > Cc: Mike Travis > > Cc: Daniel J Blueman > > Cc: Andrew Morton > > Cc: Greg KH > > Cc: Yinghai Lu > > Cc: Mel Gorman > > --- > > include/linux/mm.h | 11 +++++ > > include/linux/page-flags.h | 5 +- > > mm/nobootmem.c | 5 ++ > > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > > 4 files changed, 132 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index e0c8528..3de08b5 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > + > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > PG_private, /* If pagecache, has fs-private data */ > > PG_private_2, /* If pagecache, has fs aux data */ > > PG_writeback, /* Page is under writeback */ > > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > > > __PAGEFLAG(SlobFree, slob_free) > > > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > > + > > /* > > * Private page markings that may be used by the filesystem that owns the page > > * for its own purposes. > > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > > #define PAGE_FLAGS_CHECK_AT_FREE \ > > (1 << PG_lru | 1 << PG_locked | \ > > 1 << PG_private | 1 << PG_private_2 | \ > > - 1 << PG_writeback | 1 << PG_reserved | \ > > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > > __PG_COMPOUND_LOCK) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > > index 3b512ca..e3a386d 100644 > > --- a/mm/nobootmem.c > > +++ b/mm/nobootmem.c > > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > > phys_addr_t start, end, size; > > u64 i; > > > > + for_each_reserved_mem_region(i, &start, &end) > > + __reserve_bootmem_region(start, end); > > + > > How about holes that is not in memblock.reserved? > > before this patch: > free_area_init_node/free_area_init_core/memmap_init_zone > will mark all page in node range to Reserved in struct page, at first. > > but those holes is not mapped via kernel low mapping. > so it should be ok not touch "struct page" for them. > > Now you only mark reserved for memblock.reserved at first, and later > mark {memblock.memory} - { memblock.reserved} to be available. > And that is ok. > > but should split that change to another patch and add some comment > and change log for the change. > in case there is some user like UEFI etc do weird thing. I will split out a separate patch for this. > > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > > count += __free_memory_core(start, end); > > > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > > { > > struct pglist_data *pgdat; > > > > + memblock_dump_all(); > > + > > Not needed. Left over debug in the rush to ask our question. > > for_each_online_pgdat(pgdat) > > reset_node_lowmem_managed_pages(pgdat); > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 635b131..fe51eb9 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > + > > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > > +{ > > + unsigned long start_pfn = PFN_DOWN(start); > > + unsigned long end_pfn = PFN_UP(end); > > + > > + ensure_pages_are_initialized(start_pfn, end_pfn); > > +} > > that name is confusing, actually it is setting to struct page to Reserved only. > maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? Done. > > + > > +static inline void ensure_page_is_initialized(struct page *page) > > +{ > > + __reserve_bootmem_page(page); > > +} > > how about use __reserve_page_bootmem directly and add comment in callers site ? I really dislike that. The inline function makes the need for a comment unnecessary in my opinion and leaves the implementation localized to the one-line function. Those wanting to understand why can quickly see that the intended functionality is accomplished by the other function. I would really like to leave this as-is. > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > > if (PageAnon(page)) > > page->mapping = NULL; > > for (i = 0; i < (1 << order); i++) > > - bad += free_pages_check(page + i); > > + if (PageUninitialized2Mib(page + i)) > > + i += PTRS_PER_PMD - 1; > > + else > > + bad += free_pages_check(page + i); > > if (bad) > > return false; > > > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > > unsigned int loop; > > > > prefetchw(page); > > - for (loop = 0; loop < nr_pages; loop++) { > > + for (loop = 0; loop < nr_pages; ) { > > struct page *p = &page[loop]; > > > > if (loop + 1 < nr_pages) > > prefetchw(p + 1); > > + > > + if ((PageUninitialized2Mib(p)) && > > + ((loop + PTRS_PER_PMD) > nr_pages)) > > + ensure_page_is_initialized(p); > > + > > __ClearPageReserved(p); > > set_page_count(p, 0); > > + if (PageUninitialized2Mib(p)) > > + loop += PTRS_PER_PMD; > > + else > > + loop += 1; > > } > > > > page_zone(page)->managed_pages += 1 << order; > > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > > area--; > > high--; > > size >>= 1; > > + ensure_page_is_initialized(page); > > VM_BUG_ON(bad_range(zone, &page[size])); > > > > #ifdef CONFIG_DEBUG_PAGEALLOC > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > } > > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > > unsigned long order; > > int pages_moved = 0; > > > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > > #ifndef CONFIG_HOLES_IN_ZONE > > /* > > * page_zone is not safe to call in this context when > > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > > return 1; > > + > > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > > + pfn += PTRS_PER_PMD; > > } > > return 0; > > } > > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > > } > > } > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > why not use static ? it seems there is not outside user. Left over from early patch series where we were using this from mm/nobootmem.c. Fixed. > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > + > > /* > > * Initially all pages are reserved - free ones are freed > > * up by free_all_bootmem() once the early boot process is > > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > highest_memmap_pfn = end_pfn - 1; > > > > z = &NODE_DATA(nid)->node_zones[zone]; > > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > + for (pfn = start_pfn; pfn < end_pfn; ) { > > /* > > * There can be holes in boot-time mem_map[]s > > * handed to this function. They do not > > * exist on hotplugged memory. > > */ > > + int pfns = 1; > > if (context == MEMMAP_EARLY) { > > - if (!early_pfn_valid(pfn)) > > + if (!early_pfn_valid(pfn)) { > > + pfn++; > > continue; > > - if (!early_pfn_in_nid(pfn, nid)) > > + } > > + if (!early_pfn_in_nid(pfn, nid)) { > > + pfn++; > > continue; > > + } > > + > > + pfns = pfn_range_init_avail(pfn, end_pfn, > > + PTRS_PER_PMD, nid); > > } > > maybe could update memmap_init_zone() only iterate {memblock.memory} - > {memblock.reserved}, so you do need to check avail .... > > as memmap_init_zone do not need to handle holes to mark reserve for them. Maybe I can change pfn_range_init_avail in such a way that the __reserve_pages_bootmem() work above is not needed. I will dig into that some more before the next patch submission. > > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? Done. > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754913Ab3GPKjA (ORCPT ); Tue, 16 Jul 2013 06:39:00 -0400 Received: from relay2.sgi.com ([192.48.179.30]:46903 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754177Ab3GPKi7 (ORCPT ); Tue, 16 Jul 2013 06:38:59 -0400 Date: Tue, 16 Jul 2013 05:38:58 -0500 From: Robin Holt To: Andrew Morton Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Greg KH , Yinghai Lu , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130716103857.GH3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715143037.8287ffbf2fb0e72bc8efb287@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130715143037.8287ffbf2fb0e72bc8efb287@linux-foundation.org> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 15, 2013 at 02:30:37PM -0700, Andrew Morton wrote: > On Thu, 11 Jul 2013 21:03:55 -0500 Robin Holt wrote: > > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > ... > > > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > It isn't obvious that this needed to be inlined? It is being declared in a header file. All the other functions I came across in that header file are declared as inline (or __always_inline). It feels to me like this is right. Can I leave it as-is? > > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > "mib" creeps me out too. And it makes me think of SNMP, which I'd > prefer not to think about. > > We've traditionally had fears of running out of page flags, but I've > lost track of how close we are to that happening. IIRC the answer > depends on whether you believe there is such a thing as a 32-bit NUMA > system. > > Can this be avoided anyway? I suspect there's some idiotic combination > of flags we could use to indicate the state. PG_reserved|PG_lru or > something. > > "2MB" sounds terribly arch-specific. Shouldn't we make it more generic > for when the hexagon64 port wants to use 4MB? > > That conversational code comment was already commented on, but it's > still there? I am going to work on making it non-2m based over the course of this week, so expect the _2m (current name based on Yinghai's comments) to go away entirely. > > > > ... > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > I think this can be made static. I hope so, as it's a somewhat > odd-sounding identifier for a global. Done. > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > checkpatch them, please. Will certainly do. > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > Some nice code comments for the above two functions would be helpful. Will do. > > > > ... > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > Document it, please. The return value semantics look odd, so don't > forget to explain all that as well. Will do. Will also work on the name to make it more clear what we are returning. > > > > ... > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > It would be better if the name which is visible in procfs matches the > name in the kernel source code. Done and will try to maintain the consistency. > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, Robin From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932660Ab3GPNDF (ORCPT ); Tue, 16 Jul 2013 09:03:05 -0400 Received: from mail-pd0-f171.google.com ([209.85.192.171]:39878 "EHLO mail-pd0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752072Ab3GPNDD (ORCPT ); Tue, 16 Jul 2013 09:03:03 -0400 Message-ID: <51E5447D.70901@gmail.com> Date: Tue, 16 Jul 2013 21:02:53 +0800 From: Sam Ben User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130329 Thunderbird/17.0.5 MIME-Version: 1.0 To: Robin Holt CC: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-3-git-send-email-holt@sgi.com> In-Reply-To: <1373594635-131067-3-git-send-email-holt@sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Robin, On 07/12/2013 10:03 AM, Robin Holt wrote: > Currently, when free_all_bootmem() calls __free_pages_memory(), the > number of contiguous pages that __free_pages_memory() passes to the > buddy allocator is limited to BITS_PER_LONG. In order to be able to I fail to understand this. Why the original page number is BITS_PER_LONG? > free only the first page of a 2MiB chunk, we need that to be increased > to PTRS_PER_PMD. > > Signed-off-by: Robin Holt > Signed-off-by: Nate Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > --- > mm/nobootmem.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index bdd3fa2..3b512ca 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) > static void __init __free_pages_memory(unsigned long start, unsigned long end) > { > unsigned long i, start_aligned, end_aligned; > - int order = ilog2(BITS_PER_LONG); > + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD)); > > - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); > - end_aligned = end & ~(BITS_PER_LONG - 1); > + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1); > + end_aligned = end & ~((1UL << order) - 1); > > if (end_aligned <= start_aligned) { > for (i = start; i < end; i++) > @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) > for (i = start; i < start_aligned; i++) > __free_pages_bootmem(pfn_to_page(i), 0); > > - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) > + for (i = start_aligned; i < end_aligned; i += 1 << order) > __free_pages_bootmem(pfn_to_page(i), order); > > for (i = end_aligned; i < end; i++) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751556Ab3GQFSC (ORCPT ); Wed, 17 Jul 2013 01:18:02 -0400 Received: from mail-ob0-f170.google.com ([209.85.214.170]:44665 "EHLO mail-ob0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751149Ab3GQFSA (ORCPT ); Wed, 17 Jul 2013 01:18:00 -0400 Message-ID: <51E628F8.6030303@gmail.com> Date: Wed, 17 Jul 2013 13:17:44 +0800 From: Sam Ben User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130329 Thunderbird/17.0.5 MIME-Version: 1.0 To: Robin Holt CC: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1373594635-131067-1-git-send-email-holt@sgi.com> In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/12/2013 10:03 AM, Robin Holt wrote: > We have been working on this since we returned from shutdown and have > something to discuss now. We restricted ourselves to 2MiB initialization > to keep the patch set a little smaller and more clear. > > First, I think I want to propose getting rid of the page flag. If I knew > of a concrete way to determine that the page has not been initialized, > this patch series would look different. If there is no definitive > way to determine that the struct page has been initialized aside from > checking the entire page struct is zero, then I think I would suggest > we change the page flag to indicate the page has been initialized. > > The heart of the problem as I see it comes from expand(). We nearly > always see a first reference to a struct page which is in the middle > of the 2MiB region. Due to that access, the unlikely() check that was > originally proposed really ends up referencing a different page entirely. > We actually did not introduce an unlikely and refactor the patches to > make that unlikely inside a static inline function. Also, given the > strong warning at the head of expand(), we did not feel experienced > enough to refactor it to make things always reference the 2MiB page > first. > > With this patch, we did boot a 16TiB machine. Without the patches, > the v3.10 kernel with the same configuration took 407 seconds for > free_all_bootmem. With the patches and operating on 2MiB pages instead > of 1GiB, it took 26 seconds so performance was improved. I have no feel > for how the 1GiB chunk size will perform. How to test how much time spend on free_all_bootmem? > > I am on vacation for the next three days so I am sorry in advance for > my infrequent or non-existant responses. > > > Signed-off-by: Robin Holt > Signed-off-by: Nate Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753416Ab3GQJaz (ORCPT ); Wed, 17 Jul 2013 05:30:55 -0400 Received: from relay3.sgi.com ([192.48.152.1]:36061 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752431Ab3GQJay (ORCPT ); Wed, 17 Jul 2013 05:30:54 -0400 Date: Wed, 17 Jul 2013 04:30:52 -0500 From: Robin Holt To: Sam Ben Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130717093051.GK3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <51E628F8.6030303@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51E628F8.6030303@gmail.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: > On 07/12/2013 10:03 AM, Robin Holt wrote: > >We have been working on this since we returned from shutdown and have > >something to discuss now. We restricted ourselves to 2MiB initialization > >to keep the patch set a little smaller and more clear. > > > >First, I think I want to propose getting rid of the page flag. If I knew > >of a concrete way to determine that the page has not been initialized, > >this patch series would look different. If there is no definitive > >way to determine that the struct page has been initialized aside from > >checking the entire page struct is zero, then I think I would suggest > >we change the page flag to indicate the page has been initialized. > > > >The heart of the problem as I see it comes from expand(). We nearly > >always see a first reference to a struct page which is in the middle > >of the 2MiB region. Due to that access, the unlikely() check that was > >originally proposed really ends up referencing a different page entirely. > >We actually did not introduce an unlikely and refactor the patches to > >make that unlikely inside a static inline function. Also, given the > >strong warning at the head of expand(), we did not feel experienced > >enough to refactor it to make things always reference the 2MiB page > >first. > > > >With this patch, we did boot a 16TiB machine. Without the patches, > >the v3.10 kernel with the same configuration took 407 seconds for > >free_all_bootmem. With the patches and operating on 2MiB pages instead > >of 1GiB, it took 26 seconds so performance was improved. I have no feel > >for how the 1GiB chunk size will perform. > > How to test how much time spend on free_all_bootmem? We had put a pr_emerg at the beginning and end of free_all_bootmem and then used a modified version of script which record the time in uSecs at the beginning of each line of output. Robin > > > > >I am on vacation for the next three days so I am sorry in advance for > >my infrequent or non-existant responses. > > > > > >Signed-off-by: Robin Holt > >Signed-off-by: Nate Zimmer > >To: "H. Peter Anvin" > >To: Ingo Molnar > >Cc: Linux Kernel > >Cc: Linux MM > >Cc: Rob Landley > >Cc: Mike Travis > >Cc: Daniel J Blueman > >Cc: Andrew Morton > >Cc: Greg KH > >Cc: Yinghai Lu > >Cc: Mel Gorman > >-- > >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >the body of a message to majordomo@vger.kernel.org > >More majordomo info at http://vger.kernel.org/majordomo-info.html > >Please read the FAQ at http://www.tux.org/lkml/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756668Ab3GVGNq (ORCPT ); Mon, 22 Jul 2013 02:13:46 -0400 Received: from relay3.sgi.com ([192.48.152.1]:32809 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1756314Ab3GVGNn (ORCPT ); Mon, 22 Jul 2013 02:13:43 -0400 Date: Mon, 22 Jul 2013 01:13:39 -0500 From: Robin Holt To: Yinghai Lu Cc: Robin Holt , Sam Ben , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130722061339.GC3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <51E628F8.6030303@gmail.com> <20130717093051.GK3421@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 19, 2013 at 04:51:49PM -0700, Yinghai Lu wrote: > On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt wrote: > > On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote: > >> >With this patch, we did boot a 16TiB machine. Without the patches, > >> >the v3.10 kernel with the same configuration took 407 seconds for > >> >free_all_bootmem. With the patches and operating on 2MiB pages instead > >> >of 1GiB, it took 26 seconds so performance was improved. I have no feel > >> >for how the 1GiB chunk size will perform. > >> > >> How to test how much time spend on free_all_bootmem? > > > > We had put a pr_emerg at the beginning and end of free_all_bootmem and > > then used a modified version of script which record the time in uSecs > > at the beginning of each line of output. > > used two patches, found 3TiB system will take 100s before slub is ready. > > about three portions: > 1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those > struct page area take about 30s. > 2. memmap_init_zone: take about 25s > 3. mem_init/free_all_bootmem about 30s. > > so still wonder why 16TiB will need hours. I don't know where you got the figure of hours for memory initialization. That is likely for a 32TiB boot and includes the entire boot, not just getting the memory allocator initialized. For a 16 TiB boot: 1) 344 2) 1151 3) 407 I hope that illustrates why we chose to address the memmap_init_zone first which had the nice side effect of also impacting the free_all_bootmem slowdown. With these patches, those numbers are currently: 1) 344 2) 49 3) 26 > also your patches looks like only address 2 and 3. Right, but I thought that was the normal way to do things. Address one thing at a time and work toward a better kernel. I don't see a relationship between the work we are doing here and the sparse vmemmap buffer allocation. Have I missed something? Did you happen to time a boot with these patches applied to see how long it took and how much impact they had on a smaller config? Robin From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755688Ab3GWITG (ORCPT ); Tue, 23 Jul 2013 04:19:06 -0400 Received: from mail-ea0-f178.google.com ([209.85.215.178]:48855 "EHLO mail-ea0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754236Ab3GWITB (ORCPT ); Tue, 23 Jul 2013 04:19:01 -0400 Date: Tue, 23 Jul 2013 10:18:56 +0200 From: Ingo Molnar To: Sam Ben Cc: Borislav Petkov , Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra Subject: Re: boot tracing Message-ID: <20130723081856.GC16088@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> <20130712084712.GD24008@pd.tnic> <20130712085341.GC4328@gmail.com> <51E3529F.6070909@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51E3529F.6070909@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Sam Ben wrote: > On 07/12/2013 04:53 PM, Ingo Molnar wrote: > >* Borislav Petkov wrote: > > > >>On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote: > >>>Robert Richter and Boris Petkov are working on 'persistent events' > >>>support for perf, which will eventually allow boot time profiling - > >>>I'm not sure if the patches and the tooling support is ready enough > >>>yet for your purposes. > >>Nope, not yet but we're getting there. > >> > >>>Robert, Boris, the following workflow would be pretty intuitive: > >>> > >>> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB > >>What does perf=boot mean? I assume boot tracing. > >In this case it would mean boot profiling - i.e. a cycles hardware-PMU > >event collecting into a perf trace buffer as usual. > > > >Essentially a 'perf record -a' work-alike, just one that gets activated as > >early as practical, and which would allow the profiling of memory > >initialization. > > > >Now, one extra complication here is that to be able to profile buddy > >allocator this persistent event would have to work before the buddy > >allocator is active :-/ So this sort of profiling would have to use > >memblock_alloc(). > > Could perf=boot be used to sample the performance of memblock subsystem? > I think the perf subsystem is too late to be initialized and monitor > this. Yes, that would be a useful facility as well, for things with many events were printk is not necessarily practical. Any tracepoint could be utilized. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755809Ab3GWIUr (ORCPT ); Tue, 23 Jul 2013 04:20:47 -0400 Received: from mail-ea0-f174.google.com ([209.85.215.174]:60837 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755437Ab3GWIUo (ORCPT ); Tue, 23 Jul 2013 04:20:44 -0400 Date: Tue, 23 Jul 2013 10:20:39 +0200 From: Ingo Molnar To: Borislav Petkov Cc: Joonsoo Kim , Robin Holt , Robert Richter , "H. Peter Anvin" , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman , Peter Zijlstra Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130723082039.GD16088@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <20130712082756.GA4328@gmail.com> <20130716085502.GA31276@lge.com> <20130716090805.GC4402@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130716090805.GC4402@pd.tnic> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Borislav Petkov wrote: > On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote: > > How about executing a perf in usermodehelper and collecting output > > in tmpfs? Using this approach, we can start a perf after rootfs > > initialization, > > What for if we can start logging to buffers much earlier? *Reading* > from those buffers can be done much later, at our own leisure with full > userspace up. Yeah, agreed, I think this needs to be more integrated into the kernel, so that people don't have to worry about "when does userspace start up the earliest" details. Fundamentally all perf really needs here is some memory to initialize and buffer into. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755706Ab3GWIcT (ORCPT ); Tue, 23 Jul 2013 04:32:19 -0400 Received: from mail-ee0-f45.google.com ([74.125.83.45]:35375 "EHLO mail-ee0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754262Ab3GWIcQ (ORCPT ); Tue, 23 Jul 2013 04:32:16 -0400 Date: Tue, 23 Jul 2013 10:32:11 +0200 From: Ingo Molnar To: "H. Peter Anvin" Cc: Robin Holt , Nathan Zimmer , Yinghai Lu , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130723083211.GE16088@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> <51E43F91.1040906@zytor.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51E43F91.1040906@zytor.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * H. Peter Anvin wrote: > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > Is there a fairly cheap way to determine definitively that the struct > > page is not initialized? > > By definition I would assume no. The only way I can think of would be > to unmap the memory associated with the struct page in the TLB and > initialize the struct pages at trap time. But ... the only fastpath impact I can see of delayed initialization right now is this piece of logic in prep_new_page(): @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2Mib(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; That is where I think it can be made zero overhead in the already-initialized case, because page-flags are already used in check_new_page(): static inline int check_new_page(struct page *page) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every struct page on allocation. We can micro-optimize that low overhead to zero-overhead, by integrating the PageUninitialized2Mib() check into check_new_page(). This can be done by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { if (PageUninitialized2Mib(p)) expand_page_initialization(page); ... } if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; this will result in making it essentially zero-overhead, the expand_page_initialization() logic is now in a slowpath. Am I missing anything here? Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756922Ab3GWLJv (ORCPT ); Tue, 23 Jul 2013 07:09:51 -0400 Received: from relay3.sgi.com ([192.48.152.1]:59528 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755084Ab3GWLJu (ORCPT ); Tue, 23 Jul 2013 07:09:50 -0400 Date: Tue, 23 Jul 2013 06:09:47 -0500 From: Robin Holt To: Ingo Molnar Cc: "H. Peter Anvin" , Robin Holt , Nathan Zimmer , Yinghai Lu , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130723110947.GF3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> <51E43F91.1040906@zytor.com> <20130723083211.GE16088@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130723083211.GE16088@gmail.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > * H. Peter Anvin wrote: > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > Is there a fairly cheap way to determine definitively that the struct > > > page is not initialized? > > > > By definition I would assume no. The only way I can think of would be > > to unmap the memory associated with the struct page in the TLB and > > initialize the struct pages at trap time. > > But ... the only fastpath impact I can see of delayed initialization right > now is this piece of logic in prep_new_page(): > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > for (i = 0; i < (1 << order); i++) { > struct page *p = page + i; > + > + if (PageUninitialized2Mib(p)) > + expand_page_initialization(page); > + > if (unlikely(check_new_page(p))) > return 1; > > That is where I think it can be made zero overhead in the > already-initialized case, because page-flags are already used in > check_new_page(): The problem I see here is that the page flags we need to check for the uninitialized flag are in the "other" page for the page aligned at the 2MiB virtual address, not the page currently being referenced. Let me try a version of the patch where we set the PG_unintialized_2m flag on all pages, including the aligned pages and see what that does to performance. Robin > > static inline int check_new_page(struct page *page) > { > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > return 1; > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > struct page on allocation. > > We can micro-optimize that low overhead to zero-overhead, by integrating > the PageUninitialized2Mib() check into check_new_page(). This can be done > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > if (PageUninitialized2Mib(p)) > expand_page_initialization(page); > ... > } > > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > > return 1; > > this will result in making it essentially zero-overhead, the > expand_page_initialization() logic is now in a slowpath. > > Am I missing anything here? > > Thanks, > > Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756988Ab3GWLPv (ORCPT ); Tue, 23 Jul 2013 07:15:51 -0400 Received: from relay3.sgi.com ([192.48.152.1]:49128 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755827Ab3GWLPu (ORCPT ); Tue, 23 Jul 2013 07:15:50 -0400 Date: Tue, 23 Jul 2013 06:15:49 -0500 From: Robin Holt To: Ingo Molnar Cc: "H. Peter Anvin" , Robin Holt , Nathan Zimmer , Yinghai Lu , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130723111549.GG3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> <51E43F91.1040906@zytor.com> <20130723083211.GE16088@gmail.com> <20130723110947.GF3421@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130723110947.GF3421@sgi.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I think the other critical path which is affected is in expand(). There, we just call ensure_page_is_initialized() blindly which does the check against the other page. The below is a nearly zero addition. Sorry for the confusion. My morning coffee has not kicked in yet. Robin On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > * H. Peter Anvin wrote: > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > page is not initialized? > > > > > > By definition I would assume no. The only way I can think of would be > > > to unmap the memory associated with the struct page in the TLB and > > > initialize the struct pages at trap time. > > > > But ... the only fastpath impact I can see of delayed initialization right > > now is this piece of logic in prep_new_page(): > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > > > That is where I think it can be made zero overhead in the > > already-initialized case, because page-flags are already used in > > check_new_page(): > > The problem I see here is that the page flags we need to check for the > uninitialized flag are in the "other" page for the page aligned at the > 2MiB virtual address, not the page currently being referenced. > > Let me try a version of the patch where we set the PG_unintialized_2m > flag on all pages, including the aligned pages and see what that does > to performance. > > Robin > > > > > static inline int check_new_page(struct page *page) > > { > > if (unlikely(page_mapcount(page) | > > (page->mapping != NULL) | > > (atomic_read(&page->_count) != 0) | > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > (mem_cgroup_bad_page_check(page)))) { > > bad_page(page); > > return 1; > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > struct page on allocation. > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > if (PageUninitialized2Mib(p)) > > expand_page_initialization(page); > > ... > > } > > > > if (unlikely(page_mapcount(page) | > > (page->mapping != NULL) | > > (atomic_read(&page->_count) != 0) | > > (mem_cgroup_bad_page_check(page)))) { > > bad_page(page); > > > > return 1; > > > > this will result in making it essentially zero-overhead, the > > expand_page_initialization() logic is now in a slowpath. > > > > Am I missing anything here? > > > > Thanks, > > > > Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756947Ab3GWLly (ORCPT ); Tue, 23 Jul 2013 07:41:54 -0400 Received: from relay1.sgi.com ([192.48.179.29]:48641 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756332Ab3GWLlw (ORCPT ); Tue, 23 Jul 2013 07:41:52 -0400 Date: Tue, 23 Jul 2013 06:41:50 -0500 From: Robin Holt To: Ingo Molnar Cc: "H. Peter Anvin" , Robin Holt , Nathan Zimmer , Yinghai Lu , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130723114150.GH3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> <51E43F91.1040906@zytor.com> <20130723083211.GE16088@gmail.com> <20130723110947.GF3421@sgi.com> <20130723111549.GG3421@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130723111549.GG3421@sgi.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 23, 2013 at 06:15:49AM -0500, Robin Holt wrote: > I think the other critical path which is affected is in expand(). > There, we just call ensure_page_is_initialized() blindly which does > the check against the other page. The below is a nearly zero addition. > Sorry for the confusion. My morning coffee has not kicked in yet. I don't have access to the 16TiB system until Thursday unless the other testing on it fails early. I did boot a 2TiB system with the a change which set the Unitialized_2m flag on all pages in that 2MiB range during memmap_init_zone. That makes the expand check test against the referenced page instead of having to go back to the 2MiB page. It appears to have added less than a second to the 2TiB boot so I hope it has equally little impact to the 16TiB boot. I will clean up this patch some more and resend the currently untested set later today. Thanks, Robin > > Robin > > On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > > > * H. Peter Anvin wrote: > > > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > > page is not initialized? > > > > > > > > By definition I would assume no. The only way I can think of would be > > > > to unmap the memory associated with the struct page in the TLB and > > > > initialize the struct pages at trap time. > > > > > > But ... the only fastpath impact I can see of delayed initialization right > > > now is this piece of logic in prep_new_page(): > > > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > > > for (i = 0; i < (1 << order); i++) { > > > struct page *p = page + i; > > > + > > > + if (PageUninitialized2Mib(p)) > > > + expand_page_initialization(page); > > > + > > > if (unlikely(check_new_page(p))) > > > return 1; > > > > > > That is where I think it can be made zero overhead in the > > > already-initialized case, because page-flags are already used in > > > check_new_page(): > > > > The problem I see here is that the page flags we need to check for the > > uninitialized flag are in the "other" page for the page aligned at the > > 2MiB virtual address, not the page currently being referenced. > > > > Let me try a version of the patch where we set the PG_unintialized_2m > > flag on all pages, including the aligned pages and see what that does > > to performance. > > > > Robin > > > > > > > > static inline int check_new_page(struct page *page) > > > { > > > if (unlikely(page_mapcount(page) | > > > (page->mapping != NULL) | > > > (atomic_read(&page->_count) != 0) | > > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > > (mem_cgroup_bad_page_check(page)))) { > > > bad_page(page); > > > return 1; > > > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > > struct page on allocation. > > > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > > if (PageUninitialized2Mib(p)) > > > expand_page_initialization(page); > > > ... > > > } > > > > > > if (unlikely(page_mapcount(page) | > > > (page->mapping != NULL) | > > > (atomic_read(&page->_count) != 0) | > > > (mem_cgroup_bad_page_check(page)))) { > > > bad_page(page); > > > > > > return 1; > > > > > > this will result in making it essentially zero-overhead, the > > > expand_page_initialization() logic is now in a slowpath. > > > > > > Am I missing anything here? > > > > > > Thanks, > > > > > > Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756845Ab3GWLuZ (ORCPT ); Tue, 23 Jul 2013 07:50:25 -0400 Received: from relay2.sgi.com ([192.48.179.30]:48658 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1756129Ab3GWLuX (ORCPT ); Tue, 23 Jul 2013 07:50:23 -0400 Date: Tue, 23 Jul 2013 06:50:21 -0500 From: Robin Holt To: Ingo Molnar Cc: "H. Peter Anvin" , Robin Holt , Nathan Zimmer , Yinghai Lu , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130723115021.GI3421@sgi.com> References: <1373594635-131067-5-git-send-email-holt@sgi.com> <20130715174551.GA58640@asylum.americas.sgi.com> <51E4375E.1010704@zytor.com> <20130715182615.GF3421@sgi.com> <51E43F91.1040906@zytor.com> <20130723083211.GE16088@gmail.com> <20130723110947.GF3421@sgi.com> <20130723111549.GG3421@sgi.com> <20130723114150.GH3421@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130723114150.GH3421@sgi.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 23, 2013 at 06:41:50AM -0500, Robin Holt wrote: > On Tue, Jul 23, 2013 at 06:15:49AM -0500, Robin Holt wrote: > > I think the other critical path which is affected is in expand(). > > There, we just call ensure_page_is_initialized() blindly which does > > the check against the other page. The below is a nearly zero addition. > > Sorry for the confusion. My morning coffee has not kicked in yet. > > I don't have access to the 16TiB system until Thursday unless the other > testing on it fails early. I did boot a 2TiB system with the a change > which set the Unitialized_2m flag on all pages in that 2MiB range > during memmap_init_zone. That makes the expand check test against > the referenced page instead of having to go back to the 2MiB page. > It appears to have added less than a second to the 2TiB boot so I hope > it has equally little impact to the 16TiB boot. I was wrong. One of the two logs I looked at was the wrong one. Setting that Unitialized2m flag on all pages added 30 seconds to the 2TiB boot's memmap_init_zone(). Please disregard. That brings me back to the belief we need a better solution for the expand() path. Robin > > I will clean up this patch some more and resend the currently untested > set later today. > > Thanks, > Robin > > > > > Robin > > > > On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote: > > > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote: > > > > > > > > * H. Peter Anvin wrote: > > > > > > > > > On 07/15/2013 11:26 AM, Robin Holt wrote: > > > > > > > > > > > Is there a fairly cheap way to determine definitively that the struct > > > > > > page is not initialized? > > > > > > > > > > By definition I would assume no. The only way I can think of would be > > > > > to unmap the memory associated with the struct page in the TLB and > > > > > initialize the struct pages at trap time. > > > > > > > > But ... the only fastpath impact I can see of delayed initialization right > > > > now is this piece of logic in prep_new_page(): > > > > > > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > > > > > for (i = 0; i < (1 << order); i++) { > > > > struct page *p = page + i; > > > > + > > > > + if (PageUninitialized2Mib(p)) > > > > + expand_page_initialization(page); > > > > + > > > > if (unlikely(check_new_page(p))) > > > > return 1; > > > > > > > > That is where I think it can be made zero overhead in the > > > > already-initialized case, because page-flags are already used in > > > > check_new_page(): > > > > > > The problem I see here is that the page flags we need to check for the > > > uninitialized flag are in the "other" page for the page aligned at the > > > 2MiB virtual address, not the page currently being referenced. > > > > > > Let me try a version of the patch where we set the PG_unintialized_2m > > > flag on all pages, including the aligned pages and see what that does > > > to performance. > > > > > > Robin > > > > > > > > > > > static inline int check_new_page(struct page *page) > > > > { > > > > if (unlikely(page_mapcount(page) | > > > > (page->mapping != NULL) | > > > > (atomic_read(&page->_count) != 0) | > > > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > > > > (mem_cgroup_bad_page_check(page)))) { > > > > bad_page(page); > > > > return 1; > > > > > > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every > > > > struct page on allocation. > > > > > > > > We can micro-optimize that low overhead to zero-overhead, by integrating > > > > the PageUninitialized2Mib() check into check_new_page(). This can be done > > > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing: > > > > > > > > > > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) { > > > > if (PageUninitialized2Mib(p)) > > > > expand_page_initialization(page); > > > > ... > > > > } > > > > > > > > if (unlikely(page_mapcount(page) | > > > > (page->mapping != NULL) | > > > > (atomic_read(&page->_count) != 0) | > > > > (mem_cgroup_bad_page_check(page)))) { > > > > bad_page(page); > > > > > > > > return 1; > > > > > > > > this will result in making it essentially zero-overhead, the > > > > expand_page_initialization() logic is now in a slowpath. > > > > > > > > Am I missing anything here? > > > > > > > > Thanks, > > > > > > > > Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757964Ab3GWPdO (ORCPT ); Tue, 23 Jul 2013 11:33:14 -0400 Received: from zene.cmpxchg.org ([85.214.230.12]:50793 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757798Ab3GWPdM (ORCPT ); Tue, 23 Jul 2013 11:33:12 -0400 Date: Tue, 23 Jul 2013 11:32:57 -0400 From: Johannes Weiner To: Sam Ben Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Yinghai Lu , Mel Gorman Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks. Message-ID: <20130723153257.GK715@cmpxchg.org> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-3-git-send-email-holt@sgi.com> <51E5447D.70901@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51E5447D.70901@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 16, 2013 at 09:02:53PM +0800, Sam Ben wrote: > Hi Robin, > On 07/12/2013 10:03 AM, Robin Holt wrote: > >Currently, when free_all_bootmem() calls __free_pages_memory(), the > >number of contiguous pages that __free_pages_memory() passes to the > >buddy allocator is limited to BITS_PER_LONG. In order to be able to > > I fail to understand this. Why the original page number is BITS_PER_LONG? The mm/bootmem.c implementation uses a bitmap to keep track of free/reserved pages. It walks that bitmap in BITS_PER_LONG steps because it is the biggest chunk that is still trivial and cheap to check if all pages are free in it (chunk == ~0UL). nobootmem.c was written based on the bootmem.c interface, so it was probably adapted to keep things similar between the two, short of a pressing reason not to. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754519Ab3GYCZu (ORCPT ); Wed, 24 Jul 2013 22:25:50 -0400 Received: from relay3.sgi.com ([192.48.152.1]:37530 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753215Ab3GYCZq (ORCPT ); Wed, 24 Jul 2013 22:25:46 -0400 Date: Wed, 24 Jul 2013 21:25:43 -0500 From: Robin Holt To: Yinghai Lu Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130725022543.GR3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote: > On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt wrote: > > During boot of large memory machines, a significant portion of boot > > is spent initializing the struct page array. The vast majority of > > those pages are not referenced during boot. > > > > Change this over to only initializing the pages when they are > > actually allocated. > > > > Besides the advantage of boot speed, this allows us the chance to > > use normal performance monitoring tools to determine where the bulk > > of time is spent during page initialization. > > > > Signed-off-by: Robin Holt > > Signed-off-by: Nate Zimmer > > To: "H. Peter Anvin" > > To: Ingo Molnar > > Cc: Linux Kernel > > Cc: Linux MM > > Cc: Rob Landley > > Cc: Mike Travis > > Cc: Daniel J Blueman > > Cc: Andrew Morton > > Cc: Greg KH > > Cc: Yinghai Lu > > Cc: Mel Gorman > > --- > > include/linux/mm.h | 11 +++++ > > include/linux/page-flags.h | 5 +- > > mm/nobootmem.c | 5 ++ > > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++-- > > 4 files changed, 132 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index e0c8528..3de08b5 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page) > > __free_page(page); > > } > > > > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > + > > +static inline void __reserve_bootmem_page(struct page *page) > > +{ > > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT; > > + phys_addr_t end = start + PAGE_SIZE; > > + > > + __reserve_bootmem_region(start, end); > > +} > > + > > static inline void free_reserved_page(struct page *page) > > { > > + __reserve_bootmem_page(page); > > __free_reserved_page(page); > > adjust_managed_page_count(page, 1); > > } > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 6d53675..79e8eb7 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -83,6 +83,7 @@ enum pageflags { > > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ > > PG_arch_1, > > PG_reserved, > > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */ > > PG_private, /* If pagecache, has fs-private data */ > > PG_private_2, /* If pagecache, has fs aux data */ > > PG_writeback, /* Page is under writeback */ > > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) > > > > __PAGEFLAG(SlobFree, slob_free) > > > > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib) > > + > > /* > > * Private page markings that may be used by the filesystem that owns the page > > * for its own purposes. > > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) > > #define PAGE_FLAGS_CHECK_AT_FREE \ > > (1 << PG_lru | 1 << PG_locked | \ > > 1 << PG_private | 1 << PG_private_2 | \ > > - 1 << PG_writeback | 1 << PG_reserved | \ > > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \ > > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ > > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ > > __PG_COMPOUND_LOCK) > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > > index 3b512ca..e3a386d 100644 > > --- a/mm/nobootmem.c > > +++ b/mm/nobootmem.c > > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void) > > phys_addr_t start, end, size; > > u64 i; > > > > + for_each_reserved_mem_region(i, &start, &end) > > + __reserve_bootmem_region(start, end); > > + > > How about holes that is not in memblock.reserved? > > before this patch: > free_area_init_node/free_area_init_core/memmap_init_zone > will mark all page in node range to Reserved in struct page, at first. > > but those holes is not mapped via kernel low mapping. > so it should be ok not touch "struct page" for them. > > Now you only mark reserved for memblock.reserved at first, and later > mark {memblock.memory} - { memblock.reserved} to be available. > And that is ok. > > but should split that change to another patch and add some comment > and change log for the change. > in case there is some user like UEFI etc do weird thing. Nate and I talked this over today. Sorry for the delay, but it was the first time we were both free. Neither of us quite understands what you are asking for here. My interpretation is that you would like us to change the use of the PageReserved flag such that during boot, we do not set the flag at all from memmap_init_zone, and then only set it on pages which, at the time of free_all_bootmem, have been allocated/reserved in the memblock allocator. Is that correct? I will start to work that up on the assumption that is what you are asking for. Robin > > > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > > count += __free_memory_core(start, end); > > > > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void) > > { > > struct pglist_data *pgdat; > > > > + memblock_dump_all(); > > + > > Not needed. > > > for_each_online_pgdat(pgdat) > > reset_node_lowmem_managed_pages(pgdat); > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 635b131..fe51eb9 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i > > #endif > > } > > > > +static void expand_page_initialization(struct page *basepage) > > +{ > > + unsigned long pfn = page_to_pfn(basepage); > > + unsigned long end_pfn = pfn + PTRS_PER_PMD; > > + unsigned long zone = page_zonenum(basepage); > > + int reserved = PageReserved(basepage); > > + int nid = page_to_nid(basepage); > > + > > + ClearPageUninitialized2Mib(basepage); > > + > > + for( pfn++; pfn < end_pfn; pfn++ ) > > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved); > > +} > > + > > +void ensure_pages_are_initialized(unsigned long start_pfn, > > + unsigned long end_pfn) > > +{ > > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); > > + unsigned long aligned_end_pfn; > > + struct page *page; > > + > > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); > > + aligned_end_pfn += PTRS_PER_PMD; > > + while (aligned_start_pfn < aligned_end_pfn) { > > + if (pfn_valid(aligned_start_pfn)) { > > + page = pfn_to_page(aligned_start_pfn); > > + > > + if(PageUninitialized2Mib(page)) > > + expand_page_initialization(page); > > + } > > + > > + aligned_start_pfn += PTRS_PER_PMD; > > + } > > +} > > + > > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > > +{ > > + unsigned long start_pfn = PFN_DOWN(start); > > + unsigned long end_pfn = PFN_UP(end); > > + > > + ensure_pages_are_initialized(start_pfn, end_pfn); > > +} > > that name is confusing, actually it is setting to struct page to Reserved only. > maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ? > > > + > > +static inline void ensure_page_is_initialized(struct page *page) > > +{ > > + __reserve_bootmem_page(page); > > +} > > how about use __reserve_page_bootmem directly and add comment in callers site ? > > > + > > static bool free_pages_prepare(struct page *page, unsigned int order) > > { > > int i; > > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) > > if (PageAnon(page)) > > page->mapping = NULL; > > for (i = 0; i < (1 << order); i++) > > - bad += free_pages_check(page + i); > > + if (PageUninitialized2Mib(page + i)) > > + i += PTRS_PER_PMD - 1; > > + else > > + bad += free_pages_check(page + i); > > if (bad) > > return false; > > > > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) > > unsigned int loop; > > > > prefetchw(page); > > - for (loop = 0; loop < nr_pages; loop++) { > > + for (loop = 0; loop < nr_pages; ) { > > struct page *p = &page[loop]; > > > > if (loop + 1 < nr_pages) > > prefetchw(p + 1); > > + > > + if ((PageUninitialized2Mib(p)) && > > + ((loop + PTRS_PER_PMD) > nr_pages)) > > + ensure_page_is_initialized(p); > > + > > __ClearPageReserved(p); > > set_page_count(p, 0); > > + if (PageUninitialized2Mib(p)) > > + loop += PTRS_PER_PMD; > > + else > > + loop += 1; > > } > > > > page_zone(page)->managed_pages += 1 << order; > > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page, > > area--; > > high--; > > size >>= 1; > > + ensure_page_is_initialized(page); > > VM_BUG_ON(bad_range(zone, &page[size])); > > > > #ifdef CONFIG_DEBUG_PAGEALLOC > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) > > > > for (i = 0; i < (1 << order); i++) { > > struct page *p = page + i; > > + > > + if (PageUninitialized2Mib(p)) > > + expand_page_initialization(page); > > + > > if (unlikely(check_new_page(p))) > > return 1; > > } > > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone, > > unsigned long order; > > int pages_moved = 0; > > > > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page)); > > #ifndef CONFIG_HOLES_IN_ZONE > > /* > > * page_zone is not safe to call in this context when > > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) > > for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) > > return 1; > > + > > + if (PageUninitialized2Mib(pfn_to_page(pfn))) > > + pfn += PTRS_PER_PMD; > > } > > return 0; > > } > > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone) > > } > > } > > > > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn, > > + unsigned long size, int nid) > why not use static ? it seems there is not outside user. > > +{ > > + unsigned long validate_end_pfn = pfn + size; > > + > > + if (pfn & (size - 1)) > > + return 1; > > + > > + if (pfn + size >= end_pfn) > > + return 1; > > + > > + while (pfn < validate_end_pfn) > > + { > > + if (!early_pfn_valid(pfn)) > > + return 1; > > + if (!early_pfn_in_nid(pfn, nid)) > > + return 1; > > + pfn++; > > + } > > + > > + return size; > > +} > > + > > /* > > * Initially all pages are reserved - free ones are freed > > * up by free_all_bootmem() once the early boot process is > > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > > highest_memmap_pfn = end_pfn - 1; > > > > z = &NODE_DATA(nid)->node_zones[zone]; > > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > > + for (pfn = start_pfn; pfn < end_pfn; ) { > > /* > > * There can be holes in boot-time mem_map[]s > > * handed to this function. They do not > > * exist on hotplugged memory. > > */ > > + int pfns = 1; > > if (context == MEMMAP_EARLY) { > > - if (!early_pfn_valid(pfn)) > > + if (!early_pfn_valid(pfn)) { > > + pfn++; > > continue; > > - if (!early_pfn_in_nid(pfn, nid)) > > + } > > + if (!early_pfn_in_nid(pfn, nid)) { > > + pfn++; > > continue; > > + } > > + > > + pfns = pfn_range_init_avail(pfn, end_pfn, > > + PTRS_PER_PMD, nid); > > } > > maybe could update memmap_init_zone() only iterate {memblock.memory} - > {memblock.reserved}, so you do need to check avail .... > > as memmap_init_zone do not need to handle holes to mark reserve for them. > > > + > > page = pfn_to_page(pfn); > > __init_single_page(page, zone, nid, 1); > > + > > + if (pfns > 1) > > + SetPageUninitialized2Mib(page); > > + > > + pfn += pfns; > > } > > } > > > > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = { > > {1UL << PG_owner_priv_1, "owner_priv_1" }, > > {1UL << PG_arch_1, "arch_1" }, > > {1UL << PG_reserved, "reserved" }, > > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" }, > > PG_uninitialized_2m ? > > > {1UL << PG_private, "private" }, > > {1UL << PG_private_2, "private_2" }, > > {1UL << PG_writeback, "writeback" }, > > Yinghai From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755955Ab3GYMu7 (ORCPT ); Thu, 25 Jul 2013 08:50:59 -0400 Received: from mail-oa0-f50.google.com ([209.85.219.50]:59849 "EHLO mail-oa0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755666Ab3GYMu5 (ORCPT ); Thu, 25 Jul 2013 08:50:57 -0400 MIME-Version: 1.0 In-Reply-To: <20130725022543.GR3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130725022543.GR3421@sgi.com> Date: Thu, 25 Jul 2013 05:50:57 -0700 X-Google-Sender-Auth: aeh1FTLMdnvnSTRlUHsahvQw9Yw Message-ID: Subject: Re: [RFC 4/4] Sparse initialization of struct page array. From: Yinghai Lu To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt wrote: >> >> How about holes that is not in memblock.reserved? >> >> before this patch: >> free_area_init_node/free_area_init_core/memmap_init_zone >> will mark all page in node range to Reserved in struct page, at first. >> >> but those holes is not mapped via kernel low mapping. >> so it should be ok not touch "struct page" for them. >> >> Now you only mark reserved for memblock.reserved at first, and later >> mark {memblock.memory} - { memblock.reserved} to be available. >> And that is ok. >> >> but should split that change to another patch and add some comment >> and change log for the change. >> in case there is some user like UEFI etc do weird thing. > > Nate and I talked this over today. Sorry for the delay, but it was the > first time we were both free. Neither of us quite understands what you > are asking for here. My interpretation is that you would like us to > change the use of the PageReserved flag such that during boot, we do not > set the flag at all from memmap_init_zone, and then only set it on pages > which, at the time of free_all_bootmem, have been allocated/reserved in > the memblock allocator. Is that correct? I will start to work that up > on the assumption that is what you are asking for. Not exactly. your change should be right, but there is some subtle change about holes handling. before mem holes between memory ranges in memblock.memory, get struct page, and initialized with to Reserved in memmap_init_zone. Those holes is not in memory.reserved, with your patches, those hole's struct page will still have all 0. Please separate change about set page to reserved according to memory.reserved to another patch. Thanks Yinghai From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756104Ab3GYNmb (ORCPT ); Thu, 25 Jul 2013 09:42:31 -0400 Received: from relay2.sgi.com ([192.48.179.30]:51509 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755417Ab3GYNm2 (ORCPT ); Thu, 25 Jul 2013 09:42:28 -0400 Date: Thu, 25 Jul 2013 08:42:27 -0500 From: Robin Holt To: Yinghai Lu Cc: Robin Holt , "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Subject: Re: [RFC 4/4] Sparse initialization of struct page array. Message-ID: <20130725134227.GT3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130725022543.GR3421@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 25, 2013 at 05:50:57AM -0700, Yinghai Lu wrote: > On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt wrote: > >> > >> How about holes that is not in memblock.reserved? > >> > >> before this patch: > >> free_area_init_node/free_area_init_core/memmap_init_zone > >> will mark all page in node range to Reserved in struct page, at first. > >> > >> but those holes is not mapped via kernel low mapping. > >> so it should be ok not touch "struct page" for them. > >> > >> Now you only mark reserved for memblock.reserved at first, and later > >> mark {memblock.memory} - { memblock.reserved} to be available. > >> And that is ok. > >> > >> but should split that change to another patch and add some comment > >> and change log for the change. > >> in case there is some user like UEFI etc do weird thing. > > > > Nate and I talked this over today. Sorry for the delay, but it was the > > first time we were both free. Neither of us quite understands what you > > are asking for here. My interpretation is that you would like us to > > change the use of the PageReserved flag such that during boot, we do not > > set the flag at all from memmap_init_zone, and then only set it on pages > > which, at the time of free_all_bootmem, have been allocated/reserved in > > the memblock allocator. Is that correct? I will start to work that up > > on the assumption that is what you are asking for. > > Not exactly. > > your change should be right, but there is some subtle change about > holes handling. > > before mem holes between memory ranges in memblock.memory, get struct page, > and initialized with to Reserved in memmap_init_zone. > Those holes is not in memory.reserved, with your patches, those hole's > struct page > will still have all 0. > > Please separate change about set page to reserved according to memory.reserved > to another patch. Just want to make sure this is where you want me to go. Here is my currently untested patch. Is that what you were expecting to have done? One thing I don't like about this patch is it seems to slow down boot in my simulator environment. I think I am going to look at restructuring things a bit to see if I can eliminate that performance penalty. Otherwise, I think I am following your directions. Thanks, Robin Holt >>From bdd2fefa59af18f283af6f066bc644ddfa5c7da8 Mon Sep 17 00:00:00 2001 From: Robin Holt Date: Thu, 25 Jul 2013 04:25:15 -0500 Subject: [RFC -v2-pre2 4/5] ZZZ Only SegPageReserved() on memblock reserved pages. --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 18 +++++++++++------- 3 files changed, 16 insertions(+), 7 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..b264a26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) totalram_pages += count; } +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 2159e68..0840af2 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 048e166..3aa30b7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -698,7 +698,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order, } static void __init_single_page(unsigned long pfn, unsigned long zone, - int nid, int reserved) + int nid, int page_count) { struct page *page = pfn_to_page(pfn); struct zone *z = &NODE_DATA(nid)->node_zones[zone]; @@ -708,12 +708,9 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, init_page_count(page); page_mapcount_reset(page); page_nid_reset_last(page); - if (reserved) { - SetPageReserved(page); - } else { - ClearPageReserved(page); - set_page_count(page, 0); - } + ClearPageReserved(page); + set_page_count(page, page_count); + /* * Mark the block movable so that blocks are reserved for * movable at startup. This will force kernel allocations @@ -741,6 +738,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, #endif } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + for (; start < end; start++) + if (pfn_valid(start)) + SetPageReserved(pfn_to_page(start)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756045Ab3GYNwU (ORCPT ); Thu, 25 Jul 2013 09:52:20 -0400 Received: from mail-ob0-f171.google.com ([209.85.214.171]:36227 "EHLO mail-ob0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755882Ab3GYNwQ (ORCPT ); Thu, 25 Jul 2013 09:52:16 -0400 MIME-Version: 1.0 In-Reply-To: <20130725134227.GT3421@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1373594635-131067-5-git-send-email-holt@sgi.com> <20130725022543.GR3421@sgi.com> <20130725134227.GT3421@sgi.com> Date: Thu, 25 Jul 2013 06:52:15 -0700 X-Google-Sender-Auth: wod5VkvBGZqQik8bN-EHGy2apic Message-ID: Subject: Re: [RFC 4/4] Sparse initialization of struct page array. From: Yinghai Lu To: Robin Holt Cc: "H. Peter Anvin" , Ingo Molnar , Nate Zimmer , Linux Kernel , Linux MM , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg KH , Mel Gorman Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 25, 2013 at 6:42 AM, Robin Holt wrote: > On Thu, Jul 25, 2013 at 05:50:57AM -0700, Yinghai Lu wrote: >> On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt wrote: >> >> >> >> How about holes that is not in memblock.reserved? >> >> >> >> before this patch: >> >> free_area_init_node/free_area_init_core/memmap_init_zone >> >> will mark all page in node range to Reserved in struct page, at first. >> >> >> >> but those holes is not mapped via kernel low mapping. >> >> so it should be ok not touch "struct page" for them. >> >> >> >> Now you only mark reserved for memblock.reserved at first, and later >> >> mark {memblock.memory} - { memblock.reserved} to be available. >> >> And that is ok. >> >> >> >> but should split that change to another patch and add some comment >> >> and change log for the change. >> >> in case there is some user like UEFI etc do weird thing. >> > >> > Nate and I talked this over today. Sorry for the delay, but it was the >> > first time we were both free. Neither of us quite understands what you >> > are asking for here. My interpretation is that you would like us to >> > change the use of the PageReserved flag such that during boot, we do not >> > set the flag at all from memmap_init_zone, and then only set it on pages >> > which, at the time of free_all_bootmem, have been allocated/reserved in >> > the memblock allocator. Is that correct? I will start to work that up >> > on the assumption that is what you are asking for. >> >> Not exactly. >> >> your change should be right, but there is some subtle change about >> holes handling. >> >> before mem holes between memory ranges in memblock.memory, get struct page, >> and initialized with to Reserved in memmap_init_zone. >> Those holes is not in memory.reserved, with your patches, those hole's >> struct page >> will still have all 0. >> >> Please separate change about set page to reserved according to memory.reserved >> to another patch. > > > Just want to make sure this is where you want me to go. Here is my > currently untested patch. Is that what you were expecting to have done? > One thing I don't like about this patch is it seems to slow down boot in > my simulator environment. I think I am going to look at restructuring > things a bit to see if I can eliminate that performance penalty. > Otherwise, I think I am following your directions. > > From bdd2fefa59af18f283af6f066bc644ddfa5c7da8 Mon Sep 17 00:00:00 2001 > From: Robin Holt > Date: Thu, 25 Jul 2013 04:25:15 -0500 > Subject: [RFC -v2-pre2 4/5] ZZZ Only SegPageReserved() on memblock reserved > pages. yes. > > --- > include/linux/mm.h | 2 ++ > mm/nobootmem.c | 3 +++ > mm/page_alloc.c | 18 +++++++++++------- > 3 files changed, 16 insertions(+), 7 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..b264a26 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) > totalram_pages += count; > } > > +extern void reserve_bootmem_region(unsigned long start, unsigned long end); > + > /* Free the reserved page into the buddy system, so it gets managed. */ > static inline void __free_reserved_page(struct page *page) > { > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 2159e68..0840af2 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + reserve_bootmem_region(start, end); > + > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 048e166..3aa30b7 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -698,7 +698,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > } > > static void __init_single_page(unsigned long pfn, unsigned long zone, > - int nid, int reserved) > + int nid, int page_count) > { > struct page *page = pfn_to_page(pfn); > struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > @@ -708,12 +708,9 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, > init_page_count(page); > page_mapcount_reset(page); > page_nid_reset_last(page); > - if (reserved) { > - SetPageReserved(page); > - } else { > - ClearPageReserved(page); > - set_page_count(page, 0); > - } > + ClearPageReserved(page); > + set_page_count(page, page_count); > + > /* > * Mark the block movable so that blocks are reserved for > * movable at startup. This will force kernel allocations > @@ -741,6 +738,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, > #endif > } > > +void reserve_bootmem_region(unsigned long start, unsigned long end) > +{ > + for (; start < end; start++) > + if (pfn_valid(start)) > + SetPageReserved(pfn_to_page(start)); > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > -- > 1.8.2.1 > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754994Ab3HBRoo (ORCPT ); Fri, 2 Aug 2013 13:44:44 -0400 Received: from relay1.sgi.com ([192.48.179.29]:33493 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753743Ab3HBRol (ORCPT ); Fri, 2 Aug 2013 13:44:41 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v2 1/5] memblock: Introduce a for_each_reserved_mem_region iterator. Date: Fri, 2 Aug 2013 12:44:23 -0500 Message-Id: <1375465467-40488-2-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Robin Holt As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755410Ab3HBRop (ORCPT ); Fri, 2 Aug 2013 13:44:45 -0400 Received: from relay1.sgi.com ([192.48.179.29]:33494 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753893Ab3HBRol (ORCPT ); Fri, 2 Aug 2013 13:44:41 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v2 2/5] Have __free_pages_memory() free in larger chunks. Date: Fri, 2 Aug 2013 12:44:24 -0500 Message-Id: <1375465467-40488-3-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Robin Holt Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased. We are increasing to the maximum size available. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/nobootmem.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..2159e68 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -82,27 +82,18 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { - unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order; - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + while (start < end) { + order = min(MAX_ORDER - 1, __ffs(start)); - if (end_aligned <= start_aligned) { - for (i = start; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + while (start + (1UL << order) > end) + order--; - return; - } - - for (i = start; i < start_aligned; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + __free_pages_bootmem(pfn_to_page(start), order); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) - __free_pages_bootmem(pfn_to_page(i), order); - - for (i = end_aligned; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + start += (1UL << order); + } } static unsigned long __init __free_memory_core(phys_addr_t start, -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754572Ab3HBRom (ORCPT ); Fri, 2 Aug 2013 13:44:42 -0400 Received: from relay1.sgi.com ([192.48.179.29]:33497 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752979Ab3HBRol (ORCPT ); Fri, 2 Aug 2013 13:44:41 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v2 4/5] Only set page reserved in the memblock region Date: Fri, 2 Aug 2013 12:44:26 -0500 Message-Id: <1375465467-40488-5-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently we when we initialze each page struct is set as reserved upon initialization. This changes to starting with the reserved bit clear and then only setting the bit in the reserved region. I could restruture a bit to eliminate the perform hit. But I wanted to make sure I am on track first. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 16 ++++++++++++---- 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..b264a26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) totalram_pages += count; } +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 2159e68..0840af2 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index df3ec13..382223e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +static void __init_single_page(unsigned long pfn, unsigned long zone, + int nid, int page_count) { struct page *page = pfn_to_page(pfn); struct zone *z = &NODE_DATA(nid)->node_zones[zone]; set_page_links(page, zone, nid, pfn); mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); page_mapcount_reset(page); page_nid_reset_last(page); - SetPageReserved(page); + set_page_count(page, page_count); + ClearPageReserved(page); /* * Mark the block movable so that blocks are reserved for @@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) #endif } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + for (; start < end; start++) + if (pfn_valid(start)) + SetPageReserved(pfn_to_page(start)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - __init_single_page(pfn, zone, nid); + __init_single_page(pfn, zone, nid, 1); } } -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755166Ab3HBRqF (ORCPT ); Fri, 2 Aug 2013 13:46:05 -0400 Received: from relay1.sgi.com ([192.48.179.29]:33506 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753950Ab3HBRol (ORCPT ); Fri, 2 Aug 2013 13:44:41 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v2 3/5] Move page initialization into a separate function. Date: Fri, 2 Aug 2013 12:44:25 -0500 Message-Id: <1375465467-40488-4-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Robin Holt Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/mm_init.c | 2 +- mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++-------------------------- 2 files changed, 41 insertions(+), 34 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5adf81e..df3ec13 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,45 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +{ + struct page *page = pfn_to_page(pfn); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + SetPageReserved(page); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3951,7 +3990,6 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { - struct page *page; unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; @@ -3972,38 +4010,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(pfn, zone, nid); } } -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754881Ab3HBRqD (ORCPT ); Fri, 2 Aug 2013 13:46:03 -0400 Received: from relay3.sgi.com ([192.48.152.1]:53200 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753988Ab3HBRol (ORCPT ); Fri, 2 Aug 2013 13:44:41 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v2 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Date: Fri, 2 Aug 2013 12:44:22 -0500 Message-Id: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1373594635-131067-1-git-send-email-holt@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We are still restricting ourselves ourselves to 2MiB initialization to keep the patch set a little smaller and more clear. We are still struggling with the expand(). Nearly always the first reference to a struct page which is in the middle of the 2MiB region. We were unable to find a good solution. Also, given the strong warning at the head of expand(), we did not feel experienced enough to refactor it to make things always reference the 2MiB page first. The only other fastpath impact left is the expansion in prep_new_page. With this patch, we did boot a 16TiB machine. The two main areas that benefit from this patch is free_all_bootmem and memmap_init_zone. Without the patches it took 407 seconds and 1151 seconds respectively. With the patches it took 220 and 49 seconds respectively. This is a total savings of 1289 seconds (21 minutes). These times were aquired using a modified version of script which record the time in uSecs at the beginning of each line of output. The previous patch set was faster through free_all_bootmem but I wanted to include Yinghai suggestion. Hopefully I didn't miss the mark too much with that patch and yes I do still need to optimize it. I know there are some still rough parts but I wanted to check in with the full patch set. Nathan Zimmer (1): Only set page reserved in the memblock region Robin Holt (4): memblock: Introduce a for_each_reserved_mem_region iterator. Have __free_pages_memory() free in larger chunks. Move page initialization into a separate function. Sparse initialization of struct page array. include/linux/memblock.h | 18 +++++ include/linux/mm.h | 2 + include/linux/page-flags.h | 5 +- mm/memblock.c | 32 ++++++++ mm/mm_init.c | 2 +- mm/nobootmem.c | 28 +++---- mm/page_alloc.c | 194 ++++++++++++++++++++++++++++++++++++--------- 7 files changed, 225 insertions(+), 56 deletions(-) -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754760Ab3HBRqB (ORCPT ); Fri, 2 Aug 2013 13:46:01 -0400 Received: from relay3.sgi.com ([192.48.152.1]:53211 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754003Ab3HBRol (ORCPT ); Fri, 2 Aug 2013 13:44:41 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v2 5/5] Sparse initialization of struct page array. Date: Fri, 2 Aug 2013 12:44:27 -0500 Message-Id: <1375465467-40488-6-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Robin Holt During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/page-flags.h | 5 +- mm/page_alloc.c | 120 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 119 insertions(+), 6 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..d592065 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized_2m, PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2m, uninitialized_2m) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized_2m | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 382223e..c2fd03a0c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -737,8 +737,53 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, #endif } +static void expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int count = page_count(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2m(basepage); + + for (pfn++; pfn < end_pfn; pfn++) + __init_single_page(pfn, zone, nid, count); +} + +static void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + ensure_pages_are_initialized(page_to_pfn(page), page_to_pfn(page)); +} + void reserve_bootmem_region(unsigned long start, unsigned long end) { + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + ensure_pages_are_initialized(start_pfn, end_pfn); + for (; start < end; start++) if (pfn_valid(start)) SetPageReserved(pfn_to_page(start)); @@ -755,7 +800,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2m(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -799,13 +847,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2m(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2m(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -860,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -907,6 +965,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) for (i = 0; i < (1 << order); i++) { struct page *p = page + i; + + if (PageUninitialized2m(p)) + expand_page_initialization(page); + if (unlikely(check_new_page(p))) return 1; } @@ -989,6 +1051,8 @@ int move_freepages(struct zone *zone, unsigned long order; int pages_moved = 0; + ensure_pages_are_initialized(page_to_pfn(start_page), + page_to_pfn(end_page)); #ifndef CONFIG_HOLES_IN_ZONE /* * page_zone is not safe to call in this context when @@ -3902,6 +3966,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn) for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn))) return 1; + + if (PageUninitialized2m(pfn_to_page(pfn))) + pfn += PTRS_PER_PMD; } return 0; } @@ -3991,6 +4058,34 @@ static void setup_zone_migrate_reserve(struct zone *zone) } /* + * This function tells us if we have many pfns we have available. + * Available meaning valid and on the specified node. + * It return either size if that many pfns are available, 1 otherwise + */ +static int __meminit pfn_range_init_avail(unsigned long pfn, + unsigned long end_pfn, + unsigned long size, int nid) +{ + unsigned long validate_end_pfn = pfn + size; + + if (pfn & (size - 1)) + return 1; + + if (pfn + size >= end_pfn) + return 1; + + while (pfn < validate_end_pfn) { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + +/* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is * done. Non-atomic initialization, single-pass. @@ -4006,19 +4101,33 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + __init_single_page(pfn, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2m(pfn_to_page(pfn)); + + pfn += pfns; } } @@ -6237,6 +6346,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized_2m, "uninitialized_2m" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753003Ab3HCUE6 (ORCPT ); Sat, 3 Aug 2013 16:04:58 -0400 Received: from relay3.sgi.com ([192.48.152.1]:47433 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752852Ab3HCUE5 (ORCPT ); Sat, 3 Aug 2013 16:04:57 -0400 Date: Sat, 3 Aug 2013 15:04:54 -0500 From: Nathan Zimmer To: Nathan Zimmer Cc: hpa@zytor.com, mingo@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: Re: [RFC v2 4/5] Only set page reserved in the memblock region Message-ID: <20130803200453.GA185972@asylum.americas.sgi.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1375465467-40488-5-git-send-email-nzimmer@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1375465467-40488-5-git-send-email-nzimmer@sgi.com> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 02, 2013 at 12:44:26PM -0500, Nathan Zimmer wrote: > Currently we when we initialze each page struct is set as reserved upon > initialization. This changes to starting with the reserved bit clear and > then only setting the bit in the reserved region. > > I could restruture a bit to eliminate the perform hit. But I wanted to make > sure I am on track first. > > Signed-off-by: Robin Holt > Signed-off-by: Nathan Zimmer > To: "H. Peter Anvin" > To: Ingo Molnar > Cc: Linux Kernel > Cc: Linux MM > Cc: Rob Landley > Cc: Mike Travis > Cc: Daniel J Blueman > Cc: Andrew Morton > Cc: Greg KH > Cc: Yinghai Lu > Cc: Mel Gorman > --- > include/linux/mm.h | 2 ++ > mm/nobootmem.c | 3 +++ > mm/page_alloc.c | 16 ++++++++++++---- > 3 files changed, 17 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e0c8528..b264a26 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) > totalram_pages += count; > } > > +extern void reserve_bootmem_region(unsigned long start, unsigned long end); > + > /* Free the reserved page into the buddy system, so it gets managed. */ > static inline void __free_reserved_page(struct page *page) > { > diff --git a/mm/nobootmem.c b/mm/nobootmem.c > index 2159e68..0840af2 100644 > --- a/mm/nobootmem.c > +++ b/mm/nobootmem.c > @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) > phys_addr_t start, end, size; > u64 i; > > + for_each_reserved_mem_region(i, &start, &end) > + reserve_bootmem_region(start, end); > + > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) > count += __free_memory_core(start, end); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index df3ec13..382223e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, > spin_unlock(&zone->lock); > } > > -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) > +static void __init_single_page(unsigned long pfn, unsigned long zone, > + int nid, int page_count) > { > struct page *page = pfn_to_page(pfn); > struct zone *z = &NODE_DATA(nid)->node_zones[zone]; > > set_page_links(page, zone, nid, pfn); > mminit_verify_page_links(page, zone, nid, pfn); > - init_page_count(page); > page_mapcount_reset(page); > page_nid_reset_last(page); > - SetPageReserved(page); > + set_page_count(page, page_count); > + ClearPageReserved(page); > > /* > * Mark the block movable so that blocks are reserved for > @@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) > #endif > } > > +void reserve_bootmem_region(unsigned long start, unsigned long end) > +{ > + for (; start < end; start++) > + if (pfn_valid(start)) > + SetPageReserved(pfn_to_page(start)); > +} > + > static bool free_pages_prepare(struct page *page, unsigned int order) > { > int i; > @@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, > if (!early_pfn_in_nid(pfn, nid)) > continue; > } > - __init_single_page(pfn, zone, nid); > + __init_single_page(pfn, zone, nid, 1); > } > } > > -- > 1.8.2.1 > Actually I believe reserve_bootmem_region is wrong. I am passing in phys_adr_t and not pfns. It should be: void reserve_bootmem_region(unsigned long start, unsigned long end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); for (; start_pfn < end_pfn; start_pfn++) if (pfn_valid(start_pfn)) SetPageReserved(pfn_to_page(start_pfn)); } That also brings the timings back in line with the previous patch set. Nate From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753284Ab3HEJ6R (ORCPT ); Mon, 5 Aug 2013 05:58:17 -0400 Received: from mail-bk0-f43.google.com ([209.85.214.43]:35808 "EHLO mail-bk0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751445Ab3HEJ6Q (ORCPT ); Mon, 5 Aug 2013 05:58:16 -0400 Date: Mon, 5 Aug 2013 11:58:12 +0200 From: Ingo Molnar To: Nathan Zimmer Cc: hpa@zytor.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: Re: [RFC v2 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130805095812.GA29404@gmail.com> References: <1373594635-131067-1-git-send-email-holt@sgi.com> <1375465467-40488-1-git-send-email-nzimmer@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Nathan Zimmer wrote: > We are still restricting ourselves ourselves to 2MiB initialization to > keep the patch set a little smaller and more clear. > > We are still struggling with the expand(). Nearly always the first > reference to a struct page which is in the middle of the 2MiB region. > We were unable to find a good solution. Also, given the strong warning > at the head of expand(), we did not feel experienced enough to refactor > it to make things always reference the 2MiB page first. The only other > fastpath impact left is the expansion in prep_new_page. I suppose it's about this chunk: @@ -860,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(page); VM_BUG_ON(bad_range(zone, &page[size])); where ensure_page_is_initialized() does, in essence: + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } where aligned_start_pfn is 2MB rounded down. which looks like an expensive loop to execute for a single page: there are 512 pages in a 2MB range, so on average this iterates 256 times, for every single page of allocation. Right? I might be missing something, but why not just represent the initialization state in 2MB chunks: it is either fully uninitialized, or fully initialized. If any page in the 'middle' gets allocated, all page heads have to get initialized. That should make the fast path test fairly cheap, basically just PageUninitialized2m(page) has to be tested - and that will fail in the post-initialization fastpath. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755718Ab3HLVys (ORCPT ); Mon, 12 Aug 2013 17:54:48 -0400 Received: from relay3.sgi.com ([192.48.152.1]:59894 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755093Ab3HLVyo (ORCPT ); Mon, 12 Aug 2013 17:54:44 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v3 4/5] Only set page reserved in the memblock region Date: Mon, 12 Aug 2013 16:54:39 -0500 Message-Id: <1376344480-156708-5-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently we when we initialze each page struct is set as reserved upon initialization. This changes to starting with the reserved bit clear and then only setting the bit in the reserved region. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 19 +++++++++++++++---- 3 files changed, 20 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0c8528..b264a26 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count) totalram_pages += count; } +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 2159e68..0840af2 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void) phys_addr_t start, end, size; u64 i; + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index df3ec13..227bd39 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +static void __init_single_page(unsigned long pfn, unsigned long zone, + int nid, int page_count) { struct page *page = pfn_to_page(pfn); struct zone *z = &NODE_DATA(nid)->node_zones[zone]; set_page_links(page, zone, nid, pfn); mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); page_mapcount_reset(page); page_nid_reset_last(page); - SetPageReserved(page); + set_page_count(page, page_count); + ClearPageReserved(page); /* * Mark the block movable so that blocks are reserved for @@ -736,6 +737,16 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) #endif } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + for (; start_pfn < end_pfn; start_pfn++) + if (pfn_valid(start_pfn)) + SetPageReserved(pfn_to_page(start_pfn)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -4010,7 +4021,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - __init_single_page(pfn, zone, nid); + __init_single_page(pfn, zone, nid, 1); } } -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755638Ab3HLVyq (ORCPT ); Mon, 12 Aug 2013 17:54:46 -0400 Received: from relay3.sgi.com ([192.48.152.1]:59876 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754873Ab3HLVyo (ORCPT ); Mon, 12 Aug 2013 17:54:44 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v3 1/5] memblock: Introduce a for_each_reserved_mem_region iterator. Date: Mon, 12 Aug 2013 16:54:36 -0500 Message-Id: <1376344480-156708-2-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Robin Holt As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f388203..e99bbd1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start, i != (u64)ULLONG_MAX; \ __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid)) +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + +/** + * for_earch_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock in. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid); diff --git a/mm/memblock.c b/mm/memblock.c index c5fad93..0d7d6e7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next_free_mem_range - next function for for_each_free_mem_range() * @idx: pointer to u64 loop variable * @nid: nid: node selector, %MAX_NUMNODES for all nodes -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755565Ab3HLVyp (ORCPT ); Mon, 12 Aug 2013 17:54:45 -0400 Received: from relay3.sgi.com ([192.48.152.1]:59874 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754866Ab3HLVyo (ORCPT ); Mon, 12 Aug 2013 17:54:44 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v3 3/5] Move page initialization into a separate function. Date: Mon, 12 Aug 2013 16:54:38 -0500 Message-Id: <1376344480-156708-4-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Robin Holt Currently, memmap_init_zone() has all the smarts for initializing a single page. When we convert to initializing pages in a 2MiB chunk, we will need to do this equivalent work from two separate places so we are breaking out a helper function. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/mm_init.c | 2 +- mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++-------------------------- 2 files changed, 41 insertions(+), 34 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index c280a02..be8a539 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, +void mminit_verify_page_links(struct page *page, enum zone_type zone, unsigned long nid, unsigned long pfn) { BUG_ON(page_to_nid(page) != nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5adf81e..df3ec13 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -697,6 +697,45 @@ static void free_one_page(struct zone *zone, struct page *page, int order, spin_unlock(&zone->lock); } +static void __init_single_page(unsigned long pfn, unsigned long zone, int nid) +{ + struct page *page = pfn_to_page(pfn); + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_nid_reset_last(page); + SetPageReserved(page); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + static bool free_pages_prepare(struct page *page, unsigned int order) { int i; @@ -3951,7 +3990,6 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { - struct page *page; unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; @@ -3972,38 +4010,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_nid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_page(pfn, zone, nid); } } -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755868Ab3HLVz6 (ORCPT ); Mon, 12 Aug 2013 17:55:58 -0400 Received: from relay3.sgi.com ([192.48.152.1]:59885 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754959Ab3HLVyo (ORCPT ); Mon, 12 Aug 2013 17:54:44 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v3 2/5] Have __free_pages_memory() free in larger chunks. Date: Mon, 12 Aug 2013 16:54:37 -0500 Message-Id: <1376344480-156708-3-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Robin Holt Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. In order to be able to free only the first page of a 2MiB chunk, we need that to be increased. We are increasing to the maximum size available. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- mm/nobootmem.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/mm/nobootmem.c b/mm/nobootmem.c index bdd3fa2..2159e68 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -82,27 +82,18 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) static void __init __free_pages_memory(unsigned long start, unsigned long end) { - unsigned long i, start_aligned, end_aligned; - int order = ilog2(BITS_PER_LONG); + int order; - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1); - end_aligned = end & ~(BITS_PER_LONG - 1); + while (start < end) { + order = min(MAX_ORDER - 1, __ffs(start)); - if (end_aligned <= start_aligned) { - for (i = start; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + while (start + (1UL << order) > end) + order--; - return; - } - - for (i = start; i < start_aligned; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + __free_pages_bootmem(pfn_to_page(start), order); - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG) - __free_pages_bootmem(pfn_to_page(i), order); - - for (i = end_aligned; i < end; i++) - __free_pages_bootmem(pfn_to_page(i), 0); + start += (1UL << order); + } } static unsigned long __init __free_memory_core(phys_addr_t start, -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755813Ab3HLVz4 (ORCPT ); Mon, 12 Aug 2013 17:55:56 -0400 Received: from relay2.sgi.com ([192.48.179.30]:50890 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754815Ab3HLVyo (ORCPT ); Mon, 12 Aug 2013 17:54:44 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Date: Mon, 12 Aug 2013 16:54:35 -0500 Message-Id: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We are still restricting ourselves ourselves to 2MiB initialization. This was initially to keep the patch set a little smaller and more clear. However given how well it is currently performing I don't see a how much better it could be with to 2GiB chunks. As far as extra overhead. We incur an extra function call to ensure_page_is_initialized but that is only really expensive when we find uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. To get a better feel for this we ran two quick tests. The first was simply timing some memhogs. This showed no measurable difference so we made a more granular test. We spawned N threads, start a timer, each thread mallocs totalmem/N then each thread writes to its memory to induce page faults, stop the timer. In this case it each thread had just under 4GB of ram to fault in. This showed a measureable difference in the page faulting. The baseline took an average of 2.68 seconds, the new version took an average of 2.75 seconds. Which is .07s slower or 2.6%. Are there some other tests I should run? With this patch, we did boot a 16TiB machine. The two main areas that benefit from this patch is free_all_bootmem and memmap_init_zone. Without the patches it took 407 seconds and 1151 seconds respectively. With the patches it took 13 and 39 seconds respectively. This is a total savings of 1506 seconds (25 minutes). These times were acquired using a modified version of script which record the time in uSecs at the beginning of each line of output. Overall I am fairly happy with the patch set at the moment. It improves boot times without noticeable runtime overhead. I am, as always, open for suggestions. v2: included the Yinghai's suggestion to not set the reserved bit until later. v3: Corrected my first attempt at moving the reserved bit. __expand_page_initialization should only be called by ensure_pages_are_initialized Nathan Zimmer (1): Only set page reserved in the memblock region Robin Holt (4): memblock: Introduce a for_each_reserved_mem_region iterator. Have __free_pages_memory() free in larger chunks. Move page initialization into a separate function. Sparse initialization of struct page array. include/linux/memblock.h | 18 +++++ include/linux/mm.h | 2 + include/linux/page-flags.h | 5 +- mm/memblock.c | 32 ++++++++ mm/mm_init.c | 2 +- mm/nobootmem.c | 28 +++---- mm/page_alloc.c | 198 ++++++++++++++++++++++++++++++++++++--------- 7 files changed, 229 insertions(+), 56 deletions(-) -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755742Ab3HLVzz (ORCPT ); Mon, 12 Aug 2013 17:55:55 -0400 Received: from relay3.sgi.com ([192.48.152.1]:59911 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755238Ab3HLVyo (ORCPT ); Mon, 12 Aug 2013 17:54:44 -0400 From: Nathan Zimmer To: hpa@zytor.com, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, nzimmer@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: [RFC v3 5/5] Sparse initialization of struct page array. Date: Mon, 12 Aug 2013 16:54:40 -0500 Message-Id: <1376344480-156708-6-git-send-email-nzimmer@sgi.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Robin Holt During boot of large memory machines, a significant portion of boot is spent initializing the struct page array. The vast majority of those pages are not referenced during boot. Change this over to only initializing the pages when they are actually allocated. Besides the advantage of boot speed, this allows us the chance to use normal performance monitoring tools to determine where the bulk of time is spent during page initialization. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer To: "H. Peter Anvin" To: Ingo Molnar Cc: Linux Kernel Cc: Linux MM Cc: Rob Landley Cc: Mike Travis Cc: Daniel J Blueman Cc: Andrew Morton Cc: Greg KH Cc: Yinghai Lu Cc: Mel Gorman --- include/linux/page-flags.h | 5 +- mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 115 insertions(+), 6 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 6d53675..d592065 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -83,6 +83,7 @@ enum pageflags { PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, PG_reserved, + PG_uninitialized_2m, PG_private, /* If pagecache, has fs-private data */ PG_private_2, /* If pagecache, has fs aux data */ PG_writeback, /* Page is under writeback */ @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked) __PAGEFLAG(SlobFree, slob_free) +PAGEFLAG(Uninitialized2m, uninitialized_2m) + /* * Private page markings that may be used by the filesystem that owns the page * for its own purposes. @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page) #define PAGE_FLAGS_CHECK_AT_FREE \ (1 << PG_lru | 1 << PG_locked | \ 1 << PG_private | 1 << PG_private_2 | \ - 1 << PG_writeback | 1 << PG_reserved | \ + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized_2m | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \ __PG_COMPOUND_LOCK) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 227bd39..6c35a58 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -737,11 +737,53 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, #endif } +static void __expand_page_initialization(struct page *basepage) +{ + unsigned long pfn = page_to_pfn(basepage); + unsigned long end_pfn = pfn + PTRS_PER_PMD; + unsigned long zone = page_zonenum(basepage); + int count = page_count(basepage); + int nid = page_to_nid(basepage); + + ClearPageUninitialized2m(basepage); + + for (pfn++; pfn < end_pfn; pfn++) + __init_single_page(pfn, zone, nid, count); +} + +static void ensure_pages_are_initialized(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1); + unsigned long aligned_end_pfn; + struct page *page; + + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1); + aligned_end_pfn += PTRS_PER_PMD; + while (aligned_start_pfn < aligned_end_pfn) { + if (pfn_valid(aligned_start_pfn)) { + page = pfn_to_page(aligned_start_pfn); + + if (PageUninitialized2m(page)) + __expand_page_initialization(page); + } + + aligned_start_pfn += PTRS_PER_PMD; + } +} + +static inline void ensure_page_is_initialized(struct page *page) +{ + ensure_pages_are_initialized(page_to_pfn(page), page_to_pfn(page)); +} + void reserve_bootmem_region(unsigned long start, unsigned long end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); + ensure_pages_are_initialized(start_pfn, end_pfn); + for (; start_pfn < end_pfn; start_pfn++) if (pfn_valid(start_pfn)) SetPageReserved(pfn_to_page(start_pfn)); @@ -758,7 +800,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (PageAnon(page)) page->mapping = NULL; for (i = 0; i < (1 << order); i++) - bad += free_pages_check(page + i); + if (PageUninitialized2m(page + i)) + i += PTRS_PER_PMD - 1; + else + bad += free_pages_check(page + i); if (bad) return false; @@ -802,13 +847,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) unsigned int loop; prefetchw(page); - for (loop = 0; loop < nr_pages; loop++) { + for (loop = 0; loop < nr_pages; ) { struct page *p = &page[loop]; if (loop + 1 < nr_pages) prefetchw(p + 1); + + if ((PageUninitialized2m(p)) && + ((loop + PTRS_PER_PMD) > nr_pages)) + ensure_page_is_initialized(p); + __ClearPageReserved(p); set_page_count(p, 0); + if (PageUninitialized2m(p)) + loop += PTRS_PER_PMD; + else + loop += 1; } page_zone(page)->managed_pages += 1 << order; @@ -863,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page, area--; high--; size >>= 1; + ensure_page_is_initialized(&page[size]); VM_BUG_ON(bad_range(zone, &page[size])); #ifdef CONFIG_DEBUG_PAGEALLOC @@ -908,8 +963,11 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) { int i; + ensure_pages_are_initialized(page_to_pfn(page), + page_to_pfn(page+(1<= end_pfn) + return 1; + + while (pfn < validate_end_pfn) { + if (!early_pfn_valid(pfn)) + return 1; + if (!early_pfn_in_nid(pfn, nid)) + return 1; + pfn++; + } + + return size; +} + +/* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is * done. Non-atomic initialization, single-pass. @@ -4009,19 +4100,33 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, highest_memmap_pfn = end_pfn - 1; z = &NODE_DATA(nid)->node_zones[zone]; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; ) { /* * There can be holes in boot-time mem_map[]s * handed to this function. They do not * exist on hotplugged memory. */ + int pfns = 1; if (context == MEMMAP_EARLY) { - if (!early_pfn_valid(pfn)) + if (!early_pfn_valid(pfn)) { + pfn++; continue; - if (!early_pfn_in_nid(pfn, nid)) + } + if (!early_pfn_in_nid(pfn, nid)) { + pfn++; continue; + } + + pfns = pfn_range_init_avail(pfn, end_pfn, + PTRS_PER_PMD, nid); } + __init_single_page(pfn, zone, nid, 1); + + if (pfns > 1) + SetPageUninitialized2m(pfn_to_page(pfn)); + + pfn += pfns; } } @@ -6240,6 +6345,7 @@ static const struct trace_print_flags pageflag_names[] = { {1UL << PG_owner_priv_1, "owner_priv_1" }, {1UL << PG_arch_1, "arch_1" }, {1UL << PG_reserved, "reserved" }, + {1UL << PG_uninitialized_2m, "uninitialized_2m" }, {1UL << PG_private, "private" }, {1UL << PG_private_2, "private_2" }, {1UL << PG_writeback, "writeback" }, -- 1.8.2.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757437Ab3HMK6w (ORCPT ); Tue, 13 Aug 2013 06:58:52 -0400 Received: from mail-ee0-f46.google.com ([74.125.83.46]:38573 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756214Ab3HMK6v (ORCPT ); Tue, 13 Aug 2013 06:58:51 -0400 Date: Tue, 13 Aug 2013 12:58:47 +0200 From: Ingo Molnar To: Nathan Zimmer Cc: hpa@zytor.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de, Linus Torvalds Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130813105847.GC2170@gmail.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Nathan Zimmer wrote: > We are still restricting ourselves ourselves to 2MiB initialization. > This was initially to keep the patch set a little smaller and more > clear. However given how well it is currently performing I don't see a > how much better it could be with to 2GiB chunks. > > As far as extra overhead. We incur an extra function call to > ensure_page_is_initialized but that is only really expensive when we > find uninitialized pages, otherwise it is a flag check once every > PTRS_PER_PMD. [...] Mind expanding on this in more detail? The main fastpath overhead we are really interested in is the 'memory is already fully ininialized and we reallocate a second time' case - i.e. the *second* (and subsequent), post-initialization allocation of any page range. Those allocations are the ones that matter most: they will occur again and again, for the lifetime of the booted up system. What extra overhead is there in that case? Only a flag check that is merged into an existing flag check (in free_pages_check()) and thus is essentially zero overhead? Or is it more involved - if yes, why? One would naively think that nothing but the flags check is needed in this case: if all 512 pages in an aligned 2MB block is fully initialized, and marked as initialized in all the 512 page heads, then no other runtime check will be needed in the future. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758137Ab3HMRJh (ORCPT ); Tue, 13 Aug 2013 13:09:37 -0400 Received: from mail-vc0-f181.google.com ([209.85.220.181]:36643 "EHLO mail-vc0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755407Ab3HMRJf (ORCPT ); Tue, 13 Aug 2013 13:09:35 -0400 MIME-Version: 1.0 In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Date: Tue, 13 Aug 2013 10:09:34 -0700 X-Google-Sender-Auth: GZpiZWhT03QfUxDb8Ym3NBhLilU Message-ID: Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator From: Linus Torvalds To: Nathan Zimmer Cc: Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 12, 2013 at 2:54 PM, Nathan Zimmer wrote: > > As far as extra overhead. We incur an extra function call to > ensure_page_is_initialized but that is only really expensive when we find > uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. > To get a better feel for this we ran two quick tests. Sorry for coming into this late and for this last version of the patch, but I have to say that I'd *much* rather see this delayed initialization using another data structure than hooking into the basic page allocation ones.. I understand that you want to do delayed initialization on some TB+ memory machines, but what I don't understand is why it has to be done when the pages have already been added to the memory management free list. Could we not do this much simpler: make the early boot insert the first few gigs of memory (initialized) synchronously into the free lists, and then have a background thread that goes through the rest? That way the MM layer would never see the uninitialized pages. And I bet that *nobody* cares if you "only" have a few gigs of ram during the first few minutes of boot, and you mysteriously end up getting more and more memory for a while until all the RAM has been initialized. IOW, just don't call __free_pages_bootmem() on all the pages al at once. If we have to remove a few __init markers to be able to do some of it later, does anybody really care? I really really dislike this "let's check if memory is initialized at runtime" approach. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758734Ab3HMRYY (ORCPT ); Tue, 13 Aug 2013 13:24:24 -0400 Received: from terminus.zytor.com ([198.137.202.10]:34784 "EHLO mail.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757977Ab3HMRYX (ORCPT ); Tue, 13 Aug 2013 13:24:23 -0400 Message-ID: <520A6BA2.7060800@zytor.com> Date: Tue, 13 Aug 2013 10:23:46 -0700 From: "H. Peter Anvin" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7 MIME-Version: 1.0 To: Linus Torvalds CC: Nathan Zimmer , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Mike Travis , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> In-Reply-To: X-Enigmail-Version: 1.5.2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/13/2013 10:09 AM, Linus Torvalds wrote: > > I really really dislike this "let's check if memory is initialized at > runtime" approach. > It does seem to be getting messy, doesn't it... The one potential serious concern if if that will end up mucking with NUMA affinity in a way that has lasting effects past boot. -hpa From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758999Ab3HMRd4 (ORCPT ); Tue, 13 Aug 2013 13:33:56 -0400 Received: from relay1.sgi.com ([192.48.179.29]:35914 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755919Ab3HMRdz (ORCPT ); Tue, 13 Aug 2013 13:33:55 -0400 Message-ID: <520A6DFC.1070201@sgi.com> Date: Tue, 13 Aug 2013 10:33:48 -0700 From: Mike Travis User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Linus Torvalds CC: Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/13/2013 10:09 AM, Linus Torvalds wrote: > On Mon, Aug 12, 2013 at 2:54 PM, Nathan Zimmer wrote: >> >> As far as extra overhead. We incur an extra function call to >> ensure_page_is_initialized but that is only really expensive when we find >> uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD. >> To get a better feel for this we ran two quick tests. > > Sorry for coming into this late and for this last version of the > patch, but I have to say that I'd *much* rather see this delayed > initialization using another data structure than hooking into the > basic page allocation ones.. > > I understand that you want to do delayed initialization on some TB+ > memory machines, but what I don't understand is why it has to be done > when the pages have already been added to the memory management free > list. > > Could we not do this much simpler: make the early boot insert the > first few gigs of memory (initialized) synchronously into the free > lists, and then have a background thread that goes through the rest? > > That way the MM layer would never see the uninitialized pages. > > And I bet that *nobody* cares if you "only" have a few gigs of ram > during the first few minutes of boot, and you mysteriously end up > getting more and more memory for a while until all the RAM has been > initialized. > > IOW, just don't call __free_pages_bootmem() on all the pages al at > once. If we have to remove a few __init markers to be able to do some > of it later, does anybody really care? > > I really really dislike this "let's check if memory is initialized at > runtime" approach. > > Linus > Initially this patch set consisted of diverting a major portion of the memory to an "absent" list during e820 processing. A very late initcall was then used to dispatch a cpu per node to add that nodes's absent memory. By nature these ran in parallel so Nathan did the work to "parallelize" various global resource locks to become per node locks. This sped up insertion considerably. And by disabling the "auto-start" of the insertion process and using a manual start command, you could monitor the insertion process and find hot spots in the memory initialization code. Also small updates to the sys/devices/{memory,node} drivers to also display the amount of memory still "absent". -Mike From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758125Ab3HMRvk (ORCPT ); Tue, 13 Aug 2013 13:51:40 -0400 Received: from mail-vc0-f170.google.com ([209.85.220.170]:46611 "EHLO mail-vc0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756881Ab3HMRvi (ORCPT ); Tue, 13 Aug 2013 13:51:38 -0400 MIME-Version: 1.0 In-Reply-To: <520A6DFC.1070201@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> Date: Tue, 13 Aug 2013 10:51:37 -0700 X-Google-Sender-Auth: hIEpHPy3GhO-h8MzU7GY3qTyZ6U Message-ID: Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator From: Linus Torvalds To: Mike Travis Cc: Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 13, 2013 at 10:33 AM, Mike Travis wrote: > > Initially this patch set consisted of diverting a major portion of the > memory to an "absent" list during e820 processing. A very late initcall > was then used to dispatch a cpu per node to add that nodes's absent > memory. By nature these ran in parallel so Nathan did the work to > "parallelize" various global resource locks to become per node locks. So quite frankly, I'm not sure how worthwhile it even is to parallelize the thing. I realize that some environments may care about getting up to full memory population very quicky, but I think it would be very rare and specialized, and shouldn't necessarily be part of the initial patches. And it really doesn't have to be an initcall at all - at least not a synchronous one. A late initcall to get the process *started*, but the process itself could easily be done with a separate thread asynchronously, and let the machine boot up while that thread is going. And in fact, I'd argue that instead of trying to make it fast and parallelize things excessively, you might want to make the memory initialization *slow*, and make all the rest of the bootup have higher priority. At that point, who cares if it takes 400 seconds to get all memory initialized? In fact, who cares if it takes twice that? Let's assume that the rest of the boot takes 30s (which is pretty aggressive for some big server with terabytes of memory), even if the memory initialization was running in the background and only during idle time for probing, I'm sure you'd have a few hundred gigs of RAM initialized by the time you can log in. And if it then takes another ten minutes until you have the full 16TB initialized, and some things might be a tad slower early on, does anybody really care? The machine will be up and running with plenty of memory, even if it may not be *all* the memory yet. I realize that benchmarking cares, and yes, I also realize that some benchmarks actually want to reboot the machine between some runs just to get repeatability, but if you're benchmarking a 16TB machine I'm guessing any serious benchmark that actually uses that much memory is going to take many hours to a few days to run anyway? Having some way to wait until the memory is all done (which might even be just a silly shell script that does "ps" and waits for the kernel threads to all go away) isn't going to kill the benchmark - and the benchmark itself will then not have to worry about hittinf the "oops, I need to initialize 2GB of RAM now because I hit an uninitialized page". Ok, so I don't know all the issues, and in many ways I don't even really care. You could do it other ways, I don't think this is a big deal. The part I hate is the runtime hook into the core MM page allocation code, so I'm just throwing out any random thing that comes to my mind that could be used to avoid that part. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759006Ab3HMSEJ (ORCPT ); Tue, 13 Aug 2013 14:04:09 -0400 Received: from relay3.sgi.com ([192.48.152.1]:52647 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1758384Ab3HMSEH (ORCPT ); Tue, 13 Aug 2013 14:04:07 -0400 Message-ID: <520A7514.9020008@sgi.com> Date: Tue, 13 Aug 2013 11:04:04 -0700 From: Mike Travis User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Linus Torvalds CC: Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/13/2013 10:51 AM, Linus Torvalds wrote: > by the time you can log in. And if it then takes another ten minutes > until you have the full 16TB initialized, and some things might be a > tad slower early on, does anybody really care? The machine will be up > and running with plenty of memory, even if it may not be *all* the > memory yet. Before the patches adding memory took ~45 mins for 16TB and almost 2 hours for 32TB. Adding it late sped up early boot but late insertion was still very slow, where the full 32TB was still not fully inserted after an hour. Doing it in parallel along with the memory hotplug lock per node, we got it down to the 10-15 minute range. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758559Ab3HMTG0 (ORCPT ); Tue, 13 Aug 2013 15:06:26 -0400 Received: from relay1.sgi.com ([192.48.179.29]:50252 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758024Ab3HMTGZ (ORCPT ); Tue, 13 Aug 2013 15:06:25 -0400 Message-ID: <520A83B0.40607@sgi.com> Date: Tue, 13 Aug 2013 12:06:24 -0700 From: Mike Travis User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Linus Torvalds CC: Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <520A7514.9020008@sgi.com> In-Reply-To: <520A7514.9020008@sgi.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/13/2013 11:04 AM, Mike Travis wrote: > > > On 8/13/2013 10:51 AM, Linus Torvalds wrote: >> by the time you can log in. And if it then takes another ten minutes >> until you have the full 16TB initialized, and some things might be a >> tad slower early on, does anybody really care? The machine will be up >> and running with plenty of memory, even if it may not be *all* the >> memory yet. > > Before the patches adding memory took ~45 mins for 16TB and almost 2 hours > for 32TB. Adding it late sped up early boot but late insertion was still > very slow, where the full 32TB was still not fully inserted after an hour. > Doing it in parallel along with the memory hotplug lock per node, we got > it down to the 10-15 minute range. > FYI, the system at this time had 128 nodes each with 256GB of memory. About 252GB was inserted into the absent list from nodes 1 .. 126. Memory on nodes 0 and 128 was left fully present. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759190Ab3HMUYg (ORCPT ); Tue, 13 Aug 2013 16:24:36 -0400 Received: from mail-ob0-f177.google.com ([209.85.214.177]:41758 "EHLO mail-ob0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758264Ab3HMUYc (ORCPT ); Tue, 13 Aug 2013 16:24:32 -0400 MIME-Version: 1.0 In-Reply-To: <520A83B0.40607@sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <520A7514.9020008@sgi.com> <520A83B0.40607@sgi.com> Date: Tue, 13 Aug 2013 13:24:31 -0700 X-Google-Sender-Auth: Ic39Y3GCJG_7cIZOWK3JoYCiHr4 Message-ID: Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator From: Yinghai Lu To: Mike Travis Cc: Linus Torvalds , Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Mel Gorman Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 13, 2013 at 12:06 PM, Mike Travis wrote: > > > On 8/13/2013 11:04 AM, Mike Travis wrote: >> >> >> On 8/13/2013 10:51 AM, Linus Torvalds wrote: >>> by the time you can log in. And if it then takes another ten minutes >>> until you have the full 16TB initialized, and some things might be a >>> tad slower early on, does anybody really care? The machine will be up >>> and running with plenty of memory, even if it may not be *all* the >>> memory yet. >> >> Before the patches adding memory took ~45 mins for 16TB and almost 2 hours >> for 32TB. Adding it late sped up early boot but late insertion was still >> very slow, where the full 32TB was still not fully inserted after an hour. >> Doing it in parallel along with the memory hotplug lock per node, we got >> it down to the 10-15 minute range. >> > > FYI, the system at this time had 128 nodes each with 256GB of memory. > About 252GB was inserted into the absent list from nodes 1 .. 126. > Memory on nodes 0 and 128 was left fully present. Can we have one topic about those boot time issues in this year kernel summit? There will be more 32 sockets x86 systems and will have lots of memory, pci chain and cpu cores. current kernel/smp.c::smp_init(), we still have | /* FIXME: This should be done in userspace --RR */ | for_each_present_cpu(cpu) { | if (num_online_cpus() >= setup_max_cpus) | break; | if (!cpu_online(cpu)) | cpu_up(cpu); | } solution would be: 1. delay some memory, pci chain, or cpus cores. 2. or parallel initialize them during booting 3. or parallel add them after booting. Thanks Yinghai From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759277Ab3HMUiA (ORCPT ); Tue, 13 Aug 2013 16:38:00 -0400 Received: from relay3.sgi.com ([192.48.152.1]:38604 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1759239Ab3HMUh7 (ORCPT ); Tue, 13 Aug 2013 16:37:59 -0400 Message-ID: <520A9924.7050301@sgi.com> Date: Tue, 13 Aug 2013 13:37:56 -0700 From: Mike Travis User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Yinghai Lu CC: Linus Torvalds , Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Mel Gorman Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <520A7514.9020008@sgi.com> <520A83B0.40607@sgi.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/13/2013 1:24 PM, Yinghai Lu wrote: >> > FYI, the system at this time had 128 nodes each with 256GB of memory. >> > About 252GB was inserted into the absent list from nodes 1 .. 126. >> > Memory on nodes 0 and 128 was left fully present. Actually, I was corrected, it was 256 nodes with 128GB (8 * 16GB dimms - which are just now coming out.) So there were 254 concurrent initialization processes running. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759368Ab3HMVfH (ORCPT ); Tue, 13 Aug 2013 17:35:07 -0400 Received: from relay1.sgi.com ([192.48.179.29]:53951 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759350Ab3HMVfF (ORCPT ); Tue, 13 Aug 2013 17:35:05 -0400 Message-ID: <520AA687.3070303@sgi.com> Date: Tue, 13 Aug 2013 16:35:03 -0500 From: Nathan Zimmer User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130804 Thunderbird/17.0.8 MIME-Version: 1.0 To: Mike Travis CC: Linus Torvalds , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <520A7514.9020008@sgi.com> In-Reply-To: <520A7514.9020008@sgi.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [128.162.233.140] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/13/2013 01:04 PM, Mike Travis wrote: > > On 8/13/2013 10:51 AM, Linus Torvalds wrote: >> by the time you can log in. And if it then takes another ten minutes >> until you have the full 16TB initialized, and some things might be a >> tad slower early on, does anybody really care? The machine will be up >> and running with plenty of memory, even if it may not be *all* the >> memory yet. > Before the patches adding memory took ~45 mins for 16TB and almost 2 hours > for 32TB. Adding it late sped up early boot but late insertion was still > very slow, where the full 32TB was still not fully inserted after an hour. > Doing it in parallel along with the memory hotplug lock per node, we got > it down to the 10-15 minute range. Yes but to get it to the 10-15 minute range I had to change an number of system locks. The system_sleep, the memory_hotplug, zonelist_mutex and there was some general alteration to various wmark routines. Some of those fixes I don't know if they would stand up to proper scrutiny but were quick and dirty hacks to allow for progress. Nate From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758431Ab3HMXKY (ORCPT ); Tue, 13 Aug 2013 19:10:24 -0400 Received: from relay2.sgi.com ([192.48.179.30]:47479 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1756090Ab3HMXKW (ORCPT ); Tue, 13 Aug 2013 19:10:22 -0400 Date: Tue, 13 Aug 2013 18:10:20 -0500 From: Nathan Zimmer To: Linus Torvalds Cc: Mike Travis , Nathan Zimmer , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130813231020.GA22667@asylum.americas.sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 13, 2013 at 10:51:37AM -0700, Linus Torvalds wrote: > I realize that benchmarking cares, and yes, I also realize that some > benchmarks actually want to reboot the machine between some runs just > to get repeatability, but if you're benchmarking a 16TB machine I'm > guessing any serious benchmark that actually uses that much memory is > going to take many hours to a few days to run anyway? Having some way > to wait until the memory is all done (which might even be just a silly > shell script that does "ps" and waits for the kernel threads to all go > away) isn't going to kill the benchmark - and the benchmark itself > will then not have to worry about hittinf the "oops, I need to > initialize 2GB of RAM now because I hit an uninitialized page". > I am not overly concerned with cost having to setup a page struct on first touch but what I need to avoid is adding more permanent cost to page faults on a system that is already "primed". > Ok, so I don't know all the issues, and in many ways I don't even > really care. You could do it other ways, I don't think this is a big > deal. The part I hate is the runtime hook into the core MM page > allocation code, so I'm just throwing out any random thing that comes > to my mind that could be used to avoid that part. > The only mm structure we are adding to is a new flag in page->flags. That didn't seem too much. I had hoped to restrict the core mm changes to check_new_page and free_pages_check but I haven't gotten there yet. Not putting on uninitialized pages on to the lru would work but then I would be concerned over any calculations based on totalpages. I might be too paranoid there but having that be incorrect until after a system is booted worries me. Nate From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758630Ab3HMXzY (ORCPT ); Tue, 13 Aug 2013 19:55:24 -0400 Received: from mail-vc0-f177.google.com ([209.85.220.177]:54638 "EHLO mail-vc0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756998Ab3HMXzX (ORCPT ); Tue, 13 Aug 2013 19:55:23 -0400 MIME-Version: 1.0 In-Reply-To: <20130813231020.GA22667@asylum.americas.sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <20130813231020.GA22667@asylum.americas.sgi.com> Date: Tue, 13 Aug 2013 16:55:21 -0700 X-Google-Sender-Auth: WPv1Wvv-5BTSGu9X12pvtElDycU Message-ID: Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator From: Linus Torvalds To: Nathan Zimmer Cc: Mike Travis , Peter Anvin , Ingo Molnar , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 13, 2013 at 4:10 PM, Nathan Zimmer wrote: > > The only mm structure we are adding to is a new flag in page->flags. > That didn't seem too much. I don't agree. I see only downsides, and no upsides. Doing the same thing *without* the downsides seems straightforward, so I simply see no reason for any extra flags or tests at runtime. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759722Ab3HNLGG (ORCPT ); Wed, 14 Aug 2013 07:06:06 -0400 Received: from mail-ea0-f169.google.com ([209.85.215.169]:60889 "EHLO mail-ea0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752168Ab3HNLGE (ORCPT ); Wed, 14 Aug 2013 07:06:04 -0400 Date: Wed, 14 Aug 2013 13:05:56 +0200 From: Ingo Molnar To: Linus Torvalds Cc: Mike Travis , Nathan Zimmer , Peter Anvin , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130814110556.GH10849@gmail.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > [...] > > Ok, so I don't know all the issues, and in many ways I don't even really > care. You could do it other ways, I don't think this is a big deal. The > part I hate is the runtime hook into the core MM page allocation code, > so I'm just throwing out any random thing that comes to my mind that > could be used to avoid that part. So, my hope was that it's possible to have a single, simple, zero-cost runtime check [zero cost for already initialized pages], because it can be merged into already existing page flag mask checks present here and executed for every freshly allocated page: static inline int check_new_page(struct page *page) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (atomic_read(&page->_count) != 0) | (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | (mem_cgroup_bad_page_check(page)))) { bad_page(page); return 1; } return 0; } We already run this for every new page allocated and the initialization check could hide in PAGE_FLAGS_CHECK_AT_PREP in a zero-cost fashion. I'd not do any of the ensure_page_is_initialized() or __expand_page_initialization() complications in this patch-set - each page head represents itself and gets iterated when check_new_page() is done. During regular bootup we'd initialize like before, except we don't set up the page heads but memset() them to zero. With each page head 32 bytes this would mean 8 GB of page head memory to clear per 1 TB - with 16 TB that's 128 GB to clear - that ought to be possible to do rather quickly, perhaps with some smart SMP cross-call approach that makes sure that each memset is done in a node-local fashion. [*] Such an approach should IMO be far smaller and less invasive than the patches presented so far: it should be below 100 lines or so. I don't know why there's such a big difference between the theory I outlined and the invasive patch-set implemented so far in practice, perhaps I'm missing some complication. I was trying to probe that difference, before giving up on the idea and punting back to the async hotplug-ish approach which would obviously work well too. All in one, I think async init just hides the real problem - there's no way memory init should take this long. Thanks, Ingo [*] alternatively maybe the main performance problem is that node-local memory is set up on a remote (boot) node? In that case I'd try to optimize it by migrating the memory init code's current node by using set_cpus_allowed() to live migrate from node to node, tracking the node whose struct page array is being initialized. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759922Ab3HNL1r (ORCPT ); Wed, 14 Aug 2013 07:27:47 -0400 Received: from mail-ea0-f173.google.com ([209.85.215.173]:44981 "EHLO mail-ea0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759901Ab3HNL1p (ORCPT ); Wed, 14 Aug 2013 07:27:45 -0400 Date: Wed, 14 Aug 2013 13:27:41 +0200 From: Ingo Molnar To: Linus Torvalds Cc: Nathan Zimmer , Mike Travis , Peter Anvin , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130814112741.GB13772@gmail.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <20130813231020.GA22667@asylum.americas.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > On Tue, Aug 13, 2013 at 4:10 PM, Nathan Zimmer wrote: > > > > The only mm structure we are adding to is a new flag in page->flags. > > That didn't seem too much. > > I don't agree. > > I see only downsides, and no upsides. Doing the same thing *without* the > downsides seems straightforward, so I simply see no reason for any extra > flags or tests at runtime. The code as presented clearly looks more involved and neither simple nor zero-cost - I was hoping for a much more simple approach. I see three solutions: - Speed up the synchronous memory init code: live migrate to the node being set up via set_cpus_allowed(), to make sure the init is always fast and local. Pros: if it solves the problem then mem init is still synchronous, deterministic and essentially equivalent to what we do today - so relatively simple and well-tested, with no 'large machine' special path. Cons: it might not be enough and we might not have scheduling enabled on the affected nodes yet. - Speed up the synchronous memory init code by paralellizing the key, most expensive initialization portion of setting up the page head arrays to per node, via SMP function-calls. Pros: by far the fastest synchronous option. (It will also test the power budget and the mains fuses right during bootup.) Cons: more complex and depends on SMP cross-calls being available at mem init time. Not necessarily hotplug friendly. - Avoid the problem by punting to async mem init. Pros: it gets us to a minimal working system quickly and leaves the memory code relatively untouched. Disadvantages: makes memory state asynchronous and non-deterministic. Stats either fluctuate shortly after bootup or have to be faked. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933265Ab3HNWPM (ORCPT ); Wed, 14 Aug 2013 18:15:12 -0400 Received: from relay2.sgi.com ([192.48.179.30]:48710 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1760170Ab3HNWPI (ORCPT ); Wed, 14 Aug 2013 18:15:08 -0400 Date: Wed, 14 Aug 2013 17:15:06 -0500 From: Nathan Zimmer To: Ingo Molnar Cc: Linus Torvalds , Mike Travis , Nathan Zimmer , Peter Anvin , Linux Kernel Mailing List , linux-mm , Robin Holt , Rob Landley , Daniel J Blueman , Andrew Morton , Greg Kroah-Hartman , Yinghai Lu , Mel Gorman Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator Message-ID: <20130814221505.GA147490@asylum.americas.sgi.com> References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> <520A6DFC.1070201@sgi.com> <20130814110556.GH10849@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130814110556.GH10849@gmail.com> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 14, 2013 at 01:05:56PM +0200, Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > [...] > > > > Ok, so I don't know all the issues, and in many ways I don't even really > > care. You could do it other ways, I don't think this is a big deal. The > > part I hate is the runtime hook into the core MM page allocation code, > > so I'm just throwing out any random thing that comes to my mind that > > could be used to avoid that part. > > So, my hope was that it's possible to have a single, simple, zero-cost > runtime check [zero cost for already initialized pages], because it can be > merged into already existing page flag mask checks present here and > executed for every freshly allocated page: > > static inline int check_new_page(struct page *page) > { > if (unlikely(page_mapcount(page) | > (page->mapping != NULL) | > (atomic_read(&page->_count) != 0) | > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) | > (mem_cgroup_bad_page_check(page)))) { > bad_page(page); > return 1; > } > return 0; > } > > We already run this for every new page allocated and the initialization > check could hide in PAGE_FLAGS_CHECK_AT_PREP in a zero-cost fashion. > > I'd not do any of the ensure_page_is_initialized() or > __expand_page_initialization() complications in this patch-set - each page > head represents itself and gets iterated when check_new_page() is done. > > During regular bootup we'd initialize like before, except we don't set up > the page heads but memset() them to zero. With each page head 32 bytes > this would mean 8 GB of page head memory to clear per 1 TB - with 16 TB > that's 128 GB to clear - that ought to be possible to do rather quickly, > perhaps with some smart SMP cross-call approach that makes sure that each > memset is done in a node-local fashion. [*] > > Such an approach should IMO be far smaller and less invasive than the > patches presented so far: it should be below 100 lines or so. > > I don't know why there's such a big difference between the theory I > outlined and the invasive patch-set implemented so far in practice, > perhaps I'm missing some complication. I was trying to probe that > difference, before giving up on the idea and punting back to the async > hotplug-ish approach which would obviously work well too. > The reason, which I failed to mention, is once we pull off a page the lru in either __rmqueue_fallback or __rmqueue_smallest the first thing we do with it is expand() or sometimes move_freepages(). These then trip over some BUG_ON and VM_BUG_ON. Those BUG_ONs are what keep causing me to delve into the ensure/expand foolishness. Nate From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754314Ab3HPQgv (ORCPT ); Fri, 16 Aug 2013 12:36:51 -0400 Received: from mga02.intel.com ([134.134.136.20]:51671 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753598Ab3HPQgp (ORCPT ); Fri, 16 Aug 2013 12:36:45 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.89,895,1367996400"; d="scan'208";a="388287263" Message-ID: <520E5517.9070606@intel.com> Date: Fri, 16 Aug 2013 09:36:39 -0700 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130803 Thunderbird/17.0.8 MIME-Version: 1.0 To: Nathan Zimmer CC: hpa@zytor.com, mingo@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, holt@sgi.com, rob@landley.net, travis@sgi.com, daniel@numascale-asia.com, akpm@linux-foundation.org, gregkh@linuxfoundation.org, yinghai@kernel.org, mgorman@suse.de Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator References: <1375465467-40488-1-git-send-email-nzimmer@sgi.com> <1376344480-156708-1-git-send-email-nzimmer@sgi.com> In-Reply-To: <1376344480-156708-1-git-send-email-nzimmer@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hey Nathan, Could you post your boot timing patches? My machines are much smaller than yours, but I'm curious how things behave here as well. I did some very imprecise timings (strace -t on a telnet attached to the serial console). The 'struct page' initializations take about a minute of boot time for me to do 1TB across 8 NUMA nodes (this is a glueless QPI system[1]). My _quick_ calculations look like it's 2x as fast to initialize node0's memory vs. the other nodes, and boot time is increased by a second for about every 30G of memory we add. So even with nothing else fancy, we could get some serious improvements from just doing the initialization locally. [1] We call anything using pure QPI without any other circuitry for the NUMA interconnects to be "glueless"