From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f54.google.com (mail-wg0-f54.google.com [74.125.82.54]) by kanga.kvack.org (Postfix) with ESMTP id 5BB926B0038 for ; Thu, 23 Apr 2015 06:33:21 -0400 (EDT) Received: by wgyo15 with SMTP id o15so13562309wgy.2 for ; Thu, 23 Apr 2015 03:33:20 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id g14si13085965wjz.39.2015.04.23.03.33.19 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:19 -0700 (PDT) From: Mel Gorman Subject: [PATCH 0/13] Parallel struct page initialisation v3 Date: Thu, 23 Apr 2015 11:33:03 +0100 Message-Id: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman The big change here is an adjustment to the topology_init path that caused soft lockups on Waiman and Daniel Blue had reported it was an expensive function. Changelog since v2 o Reduce overhead of topology_init o Remove boot-time kernel parameter to enable/disable o Enable on UMA Changelog since v1 o Always initialise low zones o Typo corrections o Rename parallel mem init to parallel struct page init o Rebase to 4.0 Struct page initialisation had been identified as one of the reasons why large machines take a long time to boot. Patches were posted a long time ago to defer initialisation until they were first used. This was rejected on the grounds it should not be necessary to hurt the fast paths. This series reuses much of the work from that time but defers the initialisation of memory to kswapd so that one thread per node initialises memory local to that node. After applying the series and setting the appropriate Kconfig variable I see this in the boot log on a 64G machine [ 7.383764] kswapd 0 initialised deferred memory in 188ms [ 7.404253] kswapd 1 initialised deferred memory in 208ms [ 7.411044] kswapd 3 initialised deferred memory in 216ms [ 7.411551] kswapd 2 initialised deferred memory in 216ms On a 1TB machine, I see [ 8.406511] kswapd 3 initialised deferred memory in 1116ms [ 8.428518] kswapd 1 initialised deferred memory in 1140ms [ 8.435977] kswapd 0 initialised deferred memory in 1148ms [ 8.437416] kswapd 2 initialised deferred memory in 1148ms Once booted the machine appears to work as normal. Boot times were measured from the time shutdown was called until ssh was available again. In the 64G case, the boot time savings are negligible. On the 1TB machine, the savings were 16 seconds. It would be nice if the people that have access to really large machines would test this series and report how much boot time is reduced. arch/ia64/mm/numa.c | 19 +-- arch/x86/Kconfig | 1 + drivers/base/node.c | 11 +- include/linux/memblock.h | 18 +++ include/linux/mm.h | 8 +- include/linux/mmzone.h | 23 ++- mm/Kconfig | 18 +++ mm/bootmem.c | 8 +- mm/internal.h | 23 ++- mm/memblock.c | 34 ++++- mm/mm_init.c | 9 +- mm/nobootmem.c | 7 +- mm/page_alloc.c | 379 ++++++++++++++++++++++++++++++++++++++++------- mm/vmscan.c | 6 +- 14 files changed, 462 insertions(+), 102 deletions(-) -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by kanga.kvack.org (Postfix) with ESMTP id 9E8846B006E for ; Thu, 23 Apr 2015 06:33:22 -0400 (EDT) Received: by widdi4 with SMTP id di4so210319982wid.0 for ; Thu, 23 Apr 2015 03:33:22 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id lc2si13038713wjb.150.2015.04.23.03.33.19 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:19 -0700 (PDT) From: Mel Gorman Subject: [PATCH 01/13] memblock: Introduce a for_each_reserved_mem_region iterator. Date: Thu, 23 Apr 2015 11:33:04 +0100 Message-Id: <1429785196-7668-2-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman From: Robin Holt As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer Signed-off-by: Mel Gorman --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index e8cc45307f8f..3075e7673c54 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -93,6 +93,9 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a, struct memblock_type *type_b, phys_addr_t *out_start, phys_addr_t *out_end, int *out_nid); +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + /** * for_each_mem_range - iterate through memblock areas from type_a and not * included in type_b. Or just type_a if type_b is NULL. @@ -132,6 +135,21 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a, __next_mem_range_rev(&i, nid, type_a, type_b, \ p_start, p_end, p_nid)) +/** + * for_each_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_MOVABLE_NODE static inline bool memblock_is_hotpluggable(struct memblock_region *m) { diff --git a/mm/memblock.c b/mm/memblock.c index 252b77bdf65e..e0cc2d174f74 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -765,6 +765,38 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next__mem_range - next function for for_each_free_mem_range() etc. * @idx: pointer to u64 loop variable * @nid: node selector, %NUMA_NO_NODE for all nodes -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f54.google.com (mail-wg0-f54.google.com [74.125.82.54]) by kanga.kvack.org (Postfix) with ESMTP id C8DEE6B0070 for ; Thu, 23 Apr 2015 06:33:24 -0400 (EDT) Received: by wgen6 with SMTP id n6so13526922wge.3 for ; Thu, 23 Apr 2015 03:33:24 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id bz8si13028140wjc.178.2015.04.23.03.33.19 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:20 -0700 (PDT) From: Mel Gorman Subject: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. Date: Thu, 23 Apr 2015 11:33:05 +0100 Message-Id: <1429785196-7668-3-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman From: Robin Holt Currently, memmap_init_zone() has all the smarts for initializing a single page. A subset of this is required for parallel page initialisation and so this patch breaks up the monolithic function in preparation. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer Signed-off-by: Mel Gorman --- mm/page_alloc.c | 79 +++++++++++++++++++++++++++++++++------------------------ 1 file changed, 46 insertions(+), 33 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 40e29429e7b0..fd7a6d09062d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -778,6 +778,51 @@ static int free_tail_pages_check(struct page *head_page, struct page *page) return 0; } +static void __meminit __init_single_page(struct page *page, unsigned long pfn, + unsigned long zone, int nid) +{ + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_cpupid_reset_last(page); + SetPageReserved(page); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + +static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone, + int nid) +{ + return __init_single_page(pfn_to_page(pfn), pfn, zone, nid); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { bool compound = PageCompound(page); @@ -4124,7 +4169,6 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { - struct page *page; unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; @@ -4145,38 +4189,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_cpupid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_pfn(pfn, zone, nid); } } -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by kanga.kvack.org (Postfix) with ESMTP id 38CC26B0071 for ; Thu, 23 Apr 2015 06:33:27 -0400 (EDT) Received: by widdi4 with SMTP id di4so210322319wid.0 for ; Thu, 23 Apr 2015 03:33:26 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id k7si31302565wiw.92.2015.04.23.03.33.20 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:21 -0700 (PDT) From: Mel Gorman Subject: [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region Date: Thu, 23 Apr 2015 11:33:06 +0100 Message-Id: <1429785196-7668-4-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman From: Nathan Zimmer Currently we when we initialze each page struct is set as reserved upon initialization. This changes to starting with the reserved bit clear and then only setting the bit in the reserved region. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 11 ++++++++++- 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 47a93928b90f..b6f82a31028a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1711,6 +1711,8 @@ extern void free_highmem_page(struct page *page); extern void adjust_managed_page_count(struct page *page, long count); extern void mem_init_print_info(const char *str); +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 90b50468333e..396f9e450dc1 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -121,6 +121,9 @@ static unsigned long __init free_low_memory_core_early(void) memblock_clear_hotplug(0, -1); + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, NUMA_NO_NODE, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fd7a6d09062d..2abb3b861e70 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -788,7 +788,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn, init_page_count(page); page_mapcount_reset(page); page_cpupid_reset_last(page); - SetPageReserved(page); /* * Mark the block movable so that blocks are reserved for @@ -823,6 +822,16 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone, return __init_single_page(pfn_to_page(pfn), pfn, zone, nid); } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + for (; start_pfn < end_pfn; start_pfn++) + if (pfn_valid(start_pfn)) + SetPageReserved(pfn_to_page(start_pfn)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { bool compound = PageCompound(page); -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by kanga.kvack.org (Postfix) with ESMTP id 9EB536B0072 for ; Thu, 23 Apr 2015 06:33:29 -0400 (EDT) Received: by widdi4 with SMTP id di4so210323538wid.0 for ; Thu, 23 Apr 2015 03:33:29 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id dk3si13616281wib.13.2015.04.23.03.33.21 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:22 -0700 (PDT) From: Mel Gorman Subject: [PATCH 04/13] mm: page_alloc: Pass PFN to __free_pages_bootmem Date: Thu, 23 Apr 2015 11:33:07 +0100 Message-Id: <1429785196-7668-5-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman __free_pages_bootmem prepares a page for release to the buddy allocator and assumes that the struct page is initialised. Parallel initialisation of struct pages defers initialisation and __free_pages_bootmem can be called for struct pages that cannot yet map struct page to PFN. This patch passes PFN to __free_pages_bootmem with no other functional change. Signed-off-by: Mel Gorman --- mm/bootmem.c | 8 ++++---- mm/internal.h | 3 ++- mm/memblock.c | 2 +- mm/nobootmem.c | 4 ++-- mm/page_alloc.c | 3 ++- 5 files changed, 11 insertions(+), 9 deletions(-) diff --git a/mm/bootmem.c b/mm/bootmem.c index 477be696511d..daf956bb4782 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -164,7 +164,7 @@ void __init free_bootmem_late(unsigned long physaddr, unsigned long size) end = PFN_DOWN(physaddr + size); for (; cursor < end; cursor++) { - __free_pages_bootmem(pfn_to_page(cursor), 0); + __free_pages_bootmem(pfn_to_page(cursor), cursor, 0); totalram_pages++; } } @@ -210,7 +210,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata) if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL) { int order = ilog2(BITS_PER_LONG); - __free_pages_bootmem(pfn_to_page(start), order); + __free_pages_bootmem(pfn_to_page(start), start, order); count += BITS_PER_LONG; start += BITS_PER_LONG; } else { @@ -220,7 +220,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata) while (vec && cur != start) { if (vec & 1) { page = pfn_to_page(cur); - __free_pages_bootmem(page, 0); + __free_pages_bootmem(page, cur, 0); count++; } vec >>= 1; @@ -234,7 +234,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata) pages = bootmem_bootmap_pages(pages); count += pages; while (pages--) - __free_pages_bootmem(page++, 0); + __free_pages_bootmem(page++, cur++, 0); bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count); diff --git a/mm/internal.h b/mm/internal.h index a96da5b0029d..76b605139c7a 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -155,7 +155,8 @@ __find_buddy_index(unsigned long page_idx, unsigned int order) } extern int __isolate_free_page(struct page *page, unsigned int order); -extern void __free_pages_bootmem(struct page *page, unsigned int order); +extern void __free_pages_bootmem(struct page *page, unsigned long pfn, + unsigned int order); extern void prep_compound_page(struct page *page, unsigned long order); #ifdef CONFIG_MEMORY_FAILURE extern bool is_free_buddy_page(struct page *page); diff --git a/mm/memblock.c b/mm/memblock.c index e0cc2d174f74..f3e97d8eeb5c 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1334,7 +1334,7 @@ void __init __memblock_free_late(phys_addr_t base, phys_addr_t size) end = PFN_DOWN(base + size); for (; cursor < end; cursor++) { - __free_pages_bootmem(pfn_to_page(cursor), 0); + __free_pages_bootmem(pfn_to_page(cursor), cursor, 0); totalram_pages++; } } diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 396f9e450dc1..bae652713ee5 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -77,7 +77,7 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) end = PFN_DOWN(addr + size); for (; cursor < end; cursor++) { - __free_pages_bootmem(pfn_to_page(cursor), 0); + __free_pages_bootmem(pfn_to_page(cursor), cursor, 0); totalram_pages++; } } @@ -92,7 +92,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) while (start + (1UL << order) > end) order--; - __free_pages_bootmem(pfn_to_page(start), order); + __free_pages_bootmem(pfn_to_page(start), start, order); start += (1UL << order); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2abb3b861e70..0a0e0f280d87 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -886,7 +886,8 @@ static void __free_pages_ok(struct page *page, unsigned int order) local_irq_restore(flags); } -void __init __free_pages_bootmem(struct page *page, unsigned int order) +void __init __free_pages_bootmem(struct page *page, unsigned long pfn, + unsigned int order) { unsigned int nr_pages = 1 << order; struct page *p = page; -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f47.google.com (mail-wg0-f47.google.com [74.125.82.47]) by kanga.kvack.org (Postfix) with ESMTP id 29B136B0073 for ; Thu, 23 Apr 2015 06:33:32 -0400 (EDT) Received: by wgen6 with SMTP id n6so13530027wge.3 for ; Thu, 23 Apr 2015 03:33:31 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id fb2si13614194wib.18.2015.04.23.03.33.22 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:23 -0700 (PDT) From: Mel Gorman Subject: [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid Date: Thu, 23 Apr 2015 11:33:08 +0100 Message-Id: <1429785196-7668-6-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman __early_pfn_to_nid() in the generic and arch-specific implementations use static variables to cache recent lookups. Without the cache boot times are much higher due to the excessive memblock lookups but it assumes that memory initialisation is single-threaded. Parallel initialisation of struct pages will break that assumption so this patch makes __early_pfn_to_nid() SMP-safe by requiring the caller to cache recent search information. early_pfn_to_nid() keeps the same interface but is only safe to use early in boot due to the use of a global static variable. meminit_pfn_in_nid() is an SMP-safe version that callers must maintain their own state for. Signed-off-by: Mel Gorman --- arch/ia64/mm/numa.c | 19 +++++++------------ include/linux/mm.h | 6 ++++-- include/linux/mmzone.h | 16 +++++++++++++++- mm/page_alloc.c | 40 +++++++++++++++++++++++++--------------- 4 files changed, 51 insertions(+), 30 deletions(-) diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c index ea21d4cad540..aa19b7ac8222 100644 --- a/arch/ia64/mm/numa.c +++ b/arch/ia64/mm/numa.c @@ -58,27 +58,22 @@ paddr_to_nid(unsigned long paddr) * SPARSEMEM to allocate the SPARSEMEM sectionmap on the NUMA node where * the section resides. */ -int __meminit __early_pfn_to_nid(unsigned long pfn) +int __meminit __early_pfn_to_nid(unsigned long pfn, + struct mminit_pfnnid_cache *state) { int i, section = pfn >> PFN_SECTION_SHIFT, ssec, esec; - /* - * NOTE: The following SMP-unsafe globals are only used early in boot - * when the kernel is running single-threaded. - */ - static int __meminitdata last_ssec, last_esec; - static int __meminitdata last_nid; - if (section >= last_ssec && section < last_esec) - return last_nid; + if (section >= state->last_start && section < state->last_end) + return state->last_nid; for (i = 0; i < num_node_memblks; i++) { ssec = node_memblk[i].start_paddr >> PA_SECTION_SHIFT; esec = (node_memblk[i].start_paddr + node_memblk[i].size + ((1L << PA_SECTION_SHIFT) - 1)) >> PA_SECTION_SHIFT; if (section >= ssec && section < esec) { - last_ssec = ssec; - last_esec = esec; - last_nid = node_memblk[i].nid; + state->last_start = ssec; + state->last_end = esec; + state->last_nid = node_memblk[i].nid; return node_memblk[i].nid; } } diff --git a/include/linux/mm.h b/include/linux/mm.h index b6f82a31028a..a8a8b161fd65 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1802,7 +1802,8 @@ extern void sparse_memory_present_with_active_regions(int nid); #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \ !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) -static inline int __early_pfn_to_nid(unsigned long pfn) +static inline int __early_pfn_to_nid(unsigned long pfn, + struct mminit_pfnnid_cache *state) { return 0; } @@ -1810,7 +1811,8 @@ static inline int __early_pfn_to_nid(unsigned long pfn) /* please see mm/page_alloc.c */ extern int __meminit early_pfn_to_nid(unsigned long pfn); /* there is a per-arch backend function. */ -extern int __meminit __early_pfn_to_nid(unsigned long pfn); +extern int __meminit __early_pfn_to_nid(unsigned long pfn, + struct mminit_pfnnid_cache *state); #endif extern void set_dma_reserve(unsigned long new_dma_reserve); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 2782df47101e..a67b33e52dfe 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1216,10 +1216,24 @@ void sparse_init(void); #define sparse_index_init(_sec, _nid) do {} while (0) #endif /* CONFIG_SPARSEMEM */ +/* + * During memory init memblocks map pfns to nids. The search is expensive and + * this caches recent lookups. The implementation of __early_pfn_to_nid + * may treat start/end as pfns or sections. + */ +struct mminit_pfnnid_cache { + unsigned long last_start; + unsigned long last_end; + int last_nid; +}; + #ifdef CONFIG_NODES_SPAN_OTHER_NODES bool early_pfn_in_nid(unsigned long pfn, int nid); +bool meminit_pfn_in_nid(unsigned long pfn, int node, + struct mminit_pfnnid_cache *state); #else -#define early_pfn_in_nid(pfn, nid) (1) +#define early_pfn_in_nid(pfn, nid) (1) +#define meminit_pfn_in_nid(pfn, nid, state) (1) #endif #ifndef early_pfn_valid diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0a0e0f280d87..f556ed63b964 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4457,39 +4457,41 @@ int __meminit init_currently_empty_zone(struct zone *zone, #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID + /* * Required by SPARSEMEM. Given a PFN, return what node the PFN is on. */ -int __meminit __early_pfn_to_nid(unsigned long pfn) +int __meminit __early_pfn_to_nid(unsigned long pfn, + struct mminit_pfnnid_cache *state) { unsigned long start_pfn, end_pfn; int nid; - /* - * NOTE: The following SMP-unsafe globals are only used early in boot - * when the kernel is running single-threaded. - */ - static unsigned long __meminitdata last_start_pfn, last_end_pfn; - static int __meminitdata last_nid; - if (last_start_pfn <= pfn && pfn < last_end_pfn) - return last_nid; + if (state->last_start <= pfn && pfn < state->last_end) + return state->last_nid; nid = memblock_search_pfn_nid(pfn, &start_pfn, &end_pfn); if (nid != -1) { - last_start_pfn = start_pfn; - last_end_pfn = end_pfn; - last_nid = nid; + state->last_start = start_pfn; + state->last_end = end_pfn; + state->last_nid = nid; } return nid; } #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ +struct __meminitdata mminit_pfnnid_cache global_init_state; + +/* Only safe to use early in boot when initialisation is single-threaded */ int __meminit early_pfn_to_nid(unsigned long pfn) { int nid; - nid = __early_pfn_to_nid(pfn); + /* The system will behave unpredictably otherwise */ + BUG_ON(system_state != SYSTEM_BOOTING); + + nid = __early_pfn_to_nid(pfn, &global_init_state); if (nid >= 0) return nid; /* just returns 0 */ @@ -4497,15 +4499,23 @@ int __meminit early_pfn_to_nid(unsigned long pfn) } #ifdef CONFIG_NODES_SPAN_OTHER_NODES -bool __meminit early_pfn_in_nid(unsigned long pfn, int node) +bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node, + struct mminit_pfnnid_cache *state) { int nid; - nid = __early_pfn_to_nid(pfn); + nid = __early_pfn_to_nid(pfn, state); if (nid >= 0 && nid != node) return false; return true; } + +/* Only safe to use early in boot when initialisation is single-threaded */ +bool __meminit early_pfn_in_nid(unsigned long pfn, int node) +{ + return meminit_pfn_in_nid(pfn, node, &global_init_state); +} + #endif /** -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f41.google.com (mail-wg0-f41.google.com [74.125.82.41]) by kanga.kvack.org (Postfix) with ESMTP id 8BCEC6B0032 for ; Thu, 23 Apr 2015 06:33:34 -0400 (EDT) Received: by wgyo15 with SMTP id o15so13567892wgy.2 for ; Thu, 23 Apr 2015 03:33:34 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id p3si31313393wia.63.2015.04.23.03.33.23 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:23 -0700 (PDT) From: Mel Gorman Subject: [PATCH 06/13] mm: meminit: Inline some helper functions Date: Thu, 23 Apr 2015 11:33:09 +0100 Message-Id: <1429785196-7668-7-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are unnecessarily visible outside memory initialisation. As well as unnecessary visibility, it's unnecessary function call overhead when initialising pages. This patch moves the helpers inline. Signed-off-by: Mel Gorman --- include/linux/mmzone.h | 9 ------ mm/page_alloc.c | 75 +++++++++++++++++++++++++------------------------- 2 files changed, 38 insertions(+), 46 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a67b33e52dfe..e3d8a2bd8d78 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1227,15 +1227,6 @@ struct mminit_pfnnid_cache { int last_nid; }; -#ifdef CONFIG_NODES_SPAN_OTHER_NODES -bool early_pfn_in_nid(unsigned long pfn, int nid); -bool meminit_pfn_in_nid(unsigned long pfn, int node, - struct mminit_pfnnid_cache *state); -#else -#define early_pfn_in_nid(pfn, nid) (1) -#define meminit_pfn_in_nid(pfn, nid, state) (1) -#endif - #ifndef early_pfn_valid #define early_pfn_valid(pfn) (1) #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f556ed63b964..8b4659aa0bc2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -907,6 +907,44 @@ void __init __free_pages_bootmem(struct page *page, unsigned long pfn, __free_pages(page, order); } +#if defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) || \ + defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) +/* Only safe to use early in boot when initialisation is single-threaded */ +struct __meminitdata mminit_pfnnid_cache global_init_state; +int __meminit early_pfn_to_nid(unsigned long pfn) +{ + int nid; + + /* The system will behave unpredictably otherwise */ + BUG_ON(system_state != SYSTEM_BOOTING); + + nid = __early_pfn_to_nid(pfn, &global_init_state); + if (nid >= 0) + return nid; + /* just returns 0 */ + return 0; +} +#endif + +#ifdef CONFIG_NODES_SPAN_OTHER_NODES +static inline bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node, + struct mminit_pfnnid_cache *state) +{ + int nid; + + nid = __early_pfn_to_nid(pfn, state); + if (nid >= 0 && nid != node) + return false; + return true; +} + +/* Only safe to use early in boot when initialisation is single-threaded */ +static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node) +{ + return meminit_pfn_in_nid(pfn, node, &global_init_state); +} +#endif + #ifdef CONFIG_CMA /* Free whole pageblock and set its migration type to MIGRATE_CMA. */ void __init init_cma_reserved_pageblock(struct page *page) @@ -4481,43 +4519,6 @@ int __meminit __early_pfn_to_nid(unsigned long pfn, } #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ -struct __meminitdata mminit_pfnnid_cache global_init_state; - -/* Only safe to use early in boot when initialisation is single-threaded */ -int __meminit early_pfn_to_nid(unsigned long pfn) -{ - int nid; - - /* The system will behave unpredictably otherwise */ - BUG_ON(system_state != SYSTEM_BOOTING); - - nid = __early_pfn_to_nid(pfn, &global_init_state); - if (nid >= 0) - return nid; - /* just returns 0 */ - return 0; -} - -#ifdef CONFIG_NODES_SPAN_OTHER_NODES -bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node, - struct mminit_pfnnid_cache *state) -{ - int nid; - - nid = __early_pfn_to_nid(pfn, state); - if (nid >= 0 && nid != node) - return false; - return true; -} - -/* Only safe to use early in boot when initialisation is single-threaded */ -bool __meminit early_pfn_in_nid(unsigned long pfn, int node) -{ - return meminit_pfn_in_nid(pfn, node, &global_init_state); -} - -#endif - /** * free_bootmem_with_active_regions - Call memblock_free_early_nid for each active range * @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed. -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id 2A20A6B006C for ; Thu, 23 Apr 2015 06:33:37 -0400 (EDT) Received: by wgyo15 with SMTP id o15so13569035wgy.2 for ; Thu, 23 Apr 2015 03:33:36 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id 13si13033945wjt.165.2015.04.23.03.33.24 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:24 -0700 (PDT) From: Mel Gorman Subject: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Date: Thu, 23 Apr 2015 11:33:10 +0100 Message-Id: <1429785196-7668-8-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman This patch initalises all low memory struct pages and 2G of the highest zone on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set but will be available in a later patch. Parallel initialisation of struct page depends on some features from memory hotplug and it is necessary to alter alter section annotations. Signed-off-by: Mel Gorman --- drivers/base/node.c | 11 +++++-- include/linux/mmzone.h | 8 ++++++ mm/Kconfig | 18 ++++++++++++ mm/internal.h | 8 ++++++ mm/page_alloc.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++-- 5 files changed, 117 insertions(+), 6 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 36fabe43cd44..d03e976b4431 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -361,12 +361,16 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid) #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE #define page_initialized(page) (page->lru.next) -static int get_nid_for_pfn(unsigned long pfn) +static int get_nid_for_pfn(struct pglist_data *pgdat, unsigned long pfn) { struct page *page; if (!pfn_valid_within(pfn)) return -1; +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT + if (pgdat && pfn >= pgdat->first_deferred_pfn) + return early_pfn_to_nid(pfn); +#endif page = pfn_to_page(pfn); if (!page_initialized(page)) return -1; @@ -378,6 +382,7 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) { int ret; unsigned long pfn, sect_start_pfn, sect_end_pfn; + struct pglist_data *pgdat = NODE_DATA(nid); if (!mem_blk) return -EFAULT; @@ -390,7 +395,7 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { int page_nid; - page_nid = get_nid_for_pfn(pfn); + page_nid = get_nid_for_pfn(pgdat, pfn); if (page_nid < 0) continue; if (page_nid != nid) @@ -429,7 +434,7 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { int nid; - nid = get_nid_for_pfn(pfn); + nid = get_nid_for_pfn(NULL, pfn); if (nid < 0) continue; if (!node_online(nid)) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e3d8a2bd8d78..4882c53b70b5 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -762,6 +762,14 @@ typedef struct pglist_data { /* Number of pages migrated during the rate limiting time interval */ unsigned long numabalancing_migrate_nr_pages; #endif + +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT + /* + * If memory initialisation on large machines is deferred then this + * is the first PFN that needs to be initialised. + */ + unsigned long first_deferred_pfn; +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/mm/Kconfig b/mm/Kconfig index a03131b6ba8e..3e40cb64e226 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -629,3 +629,21 @@ config MAX_STACK_SIZE_MB changed to a smaller value in which case that is used. A sane initial value is 80 MB. + +# For architectures that support deferred memory initialisation +config ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT + bool + +config DEFERRED_STRUCT_PAGE_INIT + bool "Defer initialisation of struct pages to kswapd" + default n + depends on ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT + depends on MEMORY_HOTPLUG + help + Ordinarily all struct pages are initialised during early boot in a + single thread. On very large machines this can take a considerable + amount of time. If this option is set, large machines will bring up + a subset of memmap at boot and then initialise the rest in parallel + when kswapd starts. This has a potential performance impact on + processes running early in the lifetime of the systemm until kswapd + finishes the initialisation. diff --git a/mm/internal.h b/mm/internal.h index 76b605139c7a..4a73f74846bd 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -385,6 +385,14 @@ static inline void mminit_verify_zonelist(void) } #endif /* CONFIG_DEBUG_MEMORY_INIT */ +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +#define __defermem_init __meminit +#define __defer_init __meminit +#else +#define __defermem_init +#define __defer_init __init +#endif + /* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */ #if defined(CONFIG_SPARSEMEM) extern void mminit_validate_memmodel_limits(unsigned long *start_pfn, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8b4659aa0bc2..c7c2d20c8bb5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -235,6 +235,64 @@ EXPORT_SYMBOL(nr_online_nodes); int page_group_by_mobility_disabled __read_mostly; +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +static inline void reset_deferred_meminit(pg_data_t *pgdat) +{ + pgdat->first_deferred_pfn = ULONG_MAX; +} + +/* Returns true if the struct page for the pfn is uninitialised */ +static inline bool __defermem_init early_page_uninitialised(unsigned long pfn) +{ + int nid = early_pfn_to_nid(pfn); + + if (pfn >= NODE_DATA(nid)->first_deferred_pfn) + return true; + + return false; +} + +/* + * Returns false when the remaining initialisation should be deferred until + * later in the boot cycle when it can be parallelised. + */ +static inline bool update_defer_init(pg_data_t *pgdat, + unsigned long pfn, unsigned long zone_end, + unsigned long *nr_initialised) +{ + /* Always populate low zones for address-contrained allocations */ + if (zone_end < pgdat_end_pfn(pgdat)) + return true; + + /* Initialise at least 2G of the highest zone */ + (*nr_initialised)++; + if (*nr_initialised > (2UL << (30 - PAGE_SHIFT)) && + (pfn & (PAGES_PER_SECTION - 1)) == 0) { + pgdat->first_deferred_pfn = pfn; + return false; + } + + return true; +} +#else +static inline void reset_deferred_meminit(pg_data_t *pgdat) +{ +} + +static inline bool early_page_uninitialised(unsigned long pfn) +{ + return false; +} + +static inline bool update_defer_init(pg_data_t *pgdat, + unsigned long pfn, unsigned long zone_end, + unsigned long *nr_initialised) +{ + return true; +} +#endif + + void set_pageblock_migratetype(struct page *page, int migratetype) { if (unlikely(page_group_by_mobility_disabled && @@ -886,8 +944,8 @@ static void __free_pages_ok(struct page *page, unsigned int order) local_irq_restore(flags); } -void __init __free_pages_bootmem(struct page *page, unsigned long pfn, - unsigned int order) +static void __defer_init __free_pages_boot_core(struct page *page, + unsigned long pfn, unsigned int order) { unsigned int nr_pages = 1 << order; struct page *p = page; @@ -945,6 +1003,14 @@ static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node) } #endif +void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn, + unsigned int order) +{ + if (early_page_uninitialised(pfn)) + return; + return __free_pages_boot_core(page, pfn, order); +} + #ifdef CONFIG_CMA /* Free whole pageblock and set its migration type to MIGRATE_CMA. */ void __init init_cma_reserved_pageblock(struct page *page) @@ -4217,14 +4283,16 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { + pg_data_t *pgdat = NODE_DATA(nid); unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; + unsigned long nr_initialised = 0; if (highest_memmap_pfn < end_pfn - 1) highest_memmap_pfn = end_pfn - 1; - z = &NODE_DATA(nid)->node_zones[zone]; + z = &pgdat->node_zones[zone]; for (pfn = start_pfn; pfn < end_pfn; pfn++) { /* * There can be holes in boot-time mem_map[]s @@ -4236,6 +4304,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, continue; if (!early_pfn_in_nid(pfn, nid)) continue; + if (!update_defer_init(pgdat, pfn, end_pfn, + &nr_initialised)) + break; } __init_single_pfn(pfn, zone, nid); } @@ -5037,6 +5108,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, /* pg_data_t should be reset to zero when it's allocated */ WARN_ON(pgdat->nr_zones || pgdat->classzone_idx); + reset_deferred_meminit(pgdat); pgdat->node_id = nid; pgdat->node_start_pfn = node_start_pfn; #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f44.google.com (mail-wg0-f44.google.com [74.125.82.44]) by kanga.kvack.org (Postfix) with ESMTP id A67FE6B0074 for ; Thu, 23 Apr 2015 06:33:39 -0400 (EDT) Received: by wgen6 with SMTP id n6so13533249wge.3 for ; Thu, 23 Apr 2015 03:33:39 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id tb3si13543976wic.122.2015.04.23.03.33.25 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:25 -0700 (PDT) From: Mel Gorman Subject: [PATCH 08/13] mm: meminit: Initialise remaining struct pages in parallel with kswapd Date: Thu, 23 Apr 2015 11:33:11 +0100 Message-Id: <1429785196-7668-9-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Only a subset of struct pages are initialised at the moment. When this patch is applied kswapd initialise the remaining struct pages in parallel. This should boot faster by spreading the work to multiple CPUs and initialising data that is local to the CPU. The user-visible effect on large machines is that free memory will appear to rapidly increase early in the lifetime of the system until kswapd reports that all memory is initialised in the kernel log. Once initialised there should be no other user-visibile effects. Signed-off-by: Mel Gorman --- mm/internal.h | 6 +++ mm/mm_init.c | 1 + mm/page_alloc.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- mm/vmscan.c | 6 ++- 4 files changed, 123 insertions(+), 6 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 4a73f74846bd..2c4057140bec 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -388,9 +388,15 @@ static inline void mminit_verify_zonelist(void) #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT #define __defermem_init __meminit #define __defer_init __meminit + +void deferred_init_memmap(int nid); #else #define __defermem_init #define __defer_init __init + +static inline void deferred_init_memmap(int nid) +{ +} #endif /* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */ diff --git a/mm/mm_init.c b/mm/mm_init.c index 5f420f7fafa1..28fbf87b20aa 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "internal.h" #ifdef CONFIG_DEBUG_MEMORY_INIT diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c7c2d20c8bb5..f2db3d7aa6cb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -252,6 +252,14 @@ static inline bool __defermem_init early_page_uninitialised(unsigned long pfn) return false; } +static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid) +{ + if (pfn >= NODE_DATA(nid)->first_deferred_pfn) + return true; + + return false; +} + /* * Returns false when the remaining initialisation should be deferred until * later in the boot cycle when it can be parallelised. @@ -284,6 +292,11 @@ static inline bool early_page_uninitialised(unsigned long pfn) return false; } +static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid) +{ + return false; +} + static inline bool update_defer_init(pg_data_t *pgdat, unsigned long pfn, unsigned long zone_end, unsigned long *nr_initialised) @@ -880,14 +893,45 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone, return __init_single_page(pfn_to_page(pfn), pfn, zone, nid); } -void reserve_bootmem_region(unsigned long start, unsigned long end) +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +static void init_reserved_page(unsigned long pfn) +{ + pg_data_t *pgdat; + int nid, zid; + + if (!early_page_uninitialised(pfn)) + return; + + nid = early_pfn_to_nid(pfn); + pgdat = NODE_DATA(nid); + + for (zid = 0; zid < MAX_NR_ZONES; zid++) { + struct zone *zone = &pgdat->node_zones[zid]; + + if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) + break; + } + __init_single_pfn(pfn, zid, nid); +} +#else +static inline void init_reserved_page(unsigned long pfn) +{ +} +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ + +void __meminit reserve_bootmem_region(unsigned long start, unsigned long end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); - for (; start_pfn < end_pfn; start_pfn++) - if (pfn_valid(start_pfn)) - SetPageReserved(pfn_to_page(start_pfn)); + for (; start_pfn < end_pfn; start_pfn++) { + if (pfn_valid(start_pfn)) { + struct page *page = pfn_to_page(start_pfn); + + init_reserved_page(start_pfn); + SetPageReserved(page); + } + } } static bool free_pages_prepare(struct page *page, unsigned int order) @@ -1011,6 +1055,67 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn, return __free_pages_boot_core(page, pfn, order); } +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +/* Initialise remaining memory on a node */ +void __defermem_init deferred_init_memmap(int nid) +{ + unsigned long start = jiffies; + struct mminit_pfnnid_cache nid_init_state = { }; + + pg_data_t *pgdat = NODE_DATA(nid); + int zid; + unsigned long first_init_pfn = pgdat->first_deferred_pfn; + + if (first_init_pfn == ULONG_MAX) + return; + + /* Sanity check boundaries */ + BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn); + BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat)); + pgdat->first_deferred_pfn = ULONG_MAX; + + for (zid = 0; zid < MAX_NR_ZONES; zid++) { + struct zone *zone = pgdat->node_zones + zid; + unsigned long walk_start, walk_end; + int i; + + for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) { + unsigned long pfn, end_pfn; + + end_pfn = min(walk_end, zone_end_pfn(zone)); + pfn = first_init_pfn; + if (pfn < walk_start) + pfn = walk_start; + if (pfn < zone->zone_start_pfn) + pfn = zone->zone_start_pfn; + + for (; pfn < end_pfn; pfn++) { + struct page *page; + + if (!pfn_valid(pfn)) + continue; + + if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) + continue; + + if (page->flags) { + VM_BUG_ON(page_zone(page) != zone); + continue; + } + + __init_single_page(page, pfn, zid, nid); + __free_pages_boot_core(page, pfn, 0); + cond_resched(); + } + first_init_pfn = max(end_pfn, first_init_pfn); + } + } + + pr_info("kswapd %d initialised deferred memory in %ums\n", nid, + jiffies_to_msecs(jiffies - start)); +} +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ + #ifdef CONFIG_CMA /* Free whole pageblock and set its migration type to MIGRATE_CMA. */ void __init init_cma_reserved_pageblock(struct page *page) @@ -4221,6 +4326,9 @@ static void setup_zone_migrate_reserve(struct zone *zone) zone->nr_migrate_reserve_block = reserve; for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { + if (!early_page_nid_uninitialised(pfn, zone_to_nid(zone))) + return; + if (!pfn_valid(pfn)) continue; page = pfn_to_page(pfn); diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8eadd71bac..c4895d26d036 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3348,7 +3348,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) * If there are applications that are active memory-allocators * (most normal use), this basically shouldn't matter. */ -static int kswapd(void *p) +static int __defermem_init kswapd(void *p) { unsigned long order, new_order; unsigned balanced_order; @@ -3383,6 +3383,8 @@ static int kswapd(void *p) tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; set_freezable(); + deferred_init_memmap(pgdat->node_id); + order = new_order = 0; balanced_order = 0; classzone_idx = new_classzone_idx = pgdat->nr_zones - 1; @@ -3538,7 +3540,7 @@ static int cpu_callback(struct notifier_block *nfb, unsigned long action, * This kswapd start function will be called by init and node-hot-add. * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added. */ -int kswapd_run(int nid) +int __defermem_init kswapd_run(int nid) { pg_data_t *pgdat = NODE_DATA(nid); int ret = 0; -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f174.google.com (mail-wi0-f174.google.com [209.85.212.174]) by kanga.kvack.org (Postfix) with ESMTP id 24A996B0075 for ; Thu, 23 Apr 2015 06:33:42 -0400 (EDT) Received: by wiax7 with SMTP id x7so9400754wia.0 for ; Thu, 23 Apr 2015 03:33:41 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id dq7si13561651wib.104.2015.04.23.03.33.26 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:26 -0700 (PDT) From: Mel Gorman Subject: [PATCH 09/13] mm: meminit: Minimise number of pfn->page lookups during initialisation Date: Thu, 23 Apr 2015 11:33:12 +0100 Message-Id: <1429785196-7668-10-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Deferred struct page initialisation is using pfn_to_page() on every PFN unnecessarily. This patch minimises the number of lookups and scheduler checks. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f2db3d7aa6cb..11125634e375 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1081,6 +1081,7 @@ void __defermem_init deferred_init_memmap(int nid) for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) { unsigned long pfn, end_pfn; + struct page *page = NULL; end_pfn = min(walk_end, zone_end_pfn(zone)); pfn = first_init_pfn; @@ -1090,13 +1091,32 @@ void __defermem_init deferred_init_memmap(int nid) pfn = zone->zone_start_pfn; for (; pfn < end_pfn; pfn++) { - struct page *page; - - if (!pfn_valid(pfn)) + if (!pfn_valid_within(pfn)) continue; - if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) + /* + * Ensure pfn_valid is checked every + * MAX_ORDER_NR_PAGES for memory holes + */ + if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) { + if (!pfn_valid(pfn)) { + page = NULL; + continue; + } + } + + if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) { + page = NULL; continue; + } + + /* Minimise pfn page lookups and scheduler checks */ + if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) { + page++; + } else { + page = pfn_to_page(pfn); + cond_resched(); + } if (page->flags) { VM_BUG_ON(page_zone(page) != zone); @@ -1105,7 +1125,6 @@ void __defermem_init deferred_init_memmap(int nid) __init_single_page(page, pfn, zid, nid); __free_pages_boot_core(page, pfn, 0); - cond_resched(); } first_init_pfn = max(end_pfn, first_init_pfn); } -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f41.google.com (mail-wg0-f41.google.com [74.125.82.41]) by kanga.kvack.org (Postfix) with ESMTP id 6D4296B0078 for ; Thu, 23 Apr 2015 06:33:44 -0400 (EDT) Received: by wgin8 with SMTP id n8so13644729wgi.0 for ; Thu, 23 Apr 2015 03:33:44 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id ab3si31322946wid.70.2015.04.23.03.33.26 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:27 -0700 (PDT) From: Mel Gorman Subject: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64 Date: Thu, 23 Apr 2015 11:33:13 +0100 Message-Id: <1429785196-7668-11-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject says it all. Other architectures may enable on a case-by-case basis after auditing early_pfn_to_nid and testing. Signed-off-by: Mel Gorman --- arch/x86/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b7d31ca55187..1beff8a8fbc9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -18,6 +18,7 @@ config X86_64 select X86_DEV_DMA_OPS select ARCH_USE_CMPXCHG_LOCKREF select HAVE_LIVEPATCH + select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT ### Arch settings config X86 -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f43.google.com (mail-wg0-f43.google.com [74.125.82.43]) by kanga.kvack.org (Postfix) with ESMTP id D6FC06B007B for ; Thu, 23 Apr 2015 06:33:46 -0400 (EDT) Received: by wgyo15 with SMTP id o15so13573018wgy.2 for ; Thu, 23 Apr 2015 03:33:46 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id x7si6767963wja.200.2015.04.23.03.33.27 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:28 -0700 (PDT) From: Mel Gorman Subject: [PATCH 11/13] mm: meminit: Free pages in large chunks where possible Date: Thu, 23 Apr 2015 11:33:14 +0100 Message-Id: <1429785196-7668-12-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Parallel struct page frees pages one at a time. Try free pages as single large pages where possible. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 46 +++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 41 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 11125634e375..73077dc63f0c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1056,6 +1056,20 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn, } #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +void __defermem_init deferred_free_range(struct page *page, unsigned long pfn, + int nr_pages) +{ + int i; + + if (nr_pages == MAX_ORDER_NR_PAGES && (pfn & (MAX_ORDER_NR_PAGES-1)) == 0) { + __free_pages_boot_core(page, pfn, MAX_ORDER-1); + return; + } + + for (i = 0; i < nr_pages; i++, page++, pfn++) + __free_pages_boot_core(page, pfn, 0); +} + /* Initialise remaining memory on a node */ void __defermem_init deferred_init_memmap(int nid) { @@ -1082,6 +1096,9 @@ void __defermem_init deferred_init_memmap(int nid) for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) { unsigned long pfn, end_pfn; struct page *page = NULL; + struct page *free_base_page = NULL; + unsigned long free_base_pfn = 0; + int nr_to_free = 0; end_pfn = min(walk_end, zone_end_pfn(zone)); pfn = first_init_pfn; @@ -1092,7 +1109,7 @@ void __defermem_init deferred_init_memmap(int nid) for (; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn)) - continue; + goto free_range; /* * Ensure pfn_valid is checked every @@ -1101,30 +1118,49 @@ void __defermem_init deferred_init_memmap(int nid) if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) { if (!pfn_valid(pfn)) { page = NULL; - continue; + goto free_range; } } if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) { page = NULL; - continue; + goto free_range; } /* Minimise pfn page lookups and scheduler checks */ if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) { page++; } else { + deferred_free_range(free_base_page, + free_base_pfn, nr_to_free); + free_base_page = NULL; + free_base_pfn = nr_to_free = 0; + page = pfn_to_page(pfn); cond_resched(); } if (page->flags) { VM_BUG_ON(page_zone(page) != zone); - continue; + goto free_range; } __init_single_page(page, pfn, zid, nid); - __free_pages_boot_core(page, pfn, 0); + if (!free_base_page) { + free_base_page = page; + free_base_pfn = pfn; + nr_to_free = 0; + } + nr_to_free++; + + /* Where possible, batch up pages for a single free */ + continue; +free_range: + /* Free the current block of pages to allocator */ + if (free_base_page) + deferred_free_range(free_base_page, free_base_pfn, nr_to_free); + free_base_page = NULL; + free_base_pfn = nr_to_free = 0; } first_init_pfn = max(end_pfn, first_init_pfn); } -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by kanga.kvack.org (Postfix) with ESMTP id 3D7FF6B007D for ; Thu, 23 Apr 2015 06:33:49 -0400 (EDT) Received: by widdi4 with SMTP id di4so210332618wid.0 for ; Thu, 23 Apr 2015 03:33:48 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id w9si13609710wif.30.2015.04.23.03.33.28 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:29 -0700 (PDT) From: Mel Gorman Subject: [PATCH 12/13] mm: meminit: Reduce number of times pageblocks are set during struct page init Date: Thu, 23 Apr 2015 11:33:15 +0100 Message-Id: <1429785196-7668-13-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman During parallel sturct page initialisation, ranges are checked for every PFN unnecessarily which increases boot times. This patch alters when the ranges are checked. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 45 +++++++++++++++++++++++---------------------- 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 73077dc63f0c..576b03bc9057 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -852,33 +852,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page) static void __meminit __init_single_page(struct page *page, unsigned long pfn, unsigned long zone, int nid) { - struct zone *z = &NODE_DATA(nid)->node_zones[zone]; - set_page_links(page, zone, nid, pfn); mminit_verify_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); page_cpupid_reset_last(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - INIT_LIST_HEAD(&page->lru); #ifdef WANT_PAGE_VIRTUAL /* The shift won't overflow because ZONE_NORMAL is below 4G. */ @@ -1062,6 +1041,7 @@ void __defermem_init deferred_free_range(struct page *page, unsigned long pfn, int i; if (nr_pages == MAX_ORDER_NR_PAGES && (pfn & (MAX_ORDER_NR_PAGES-1)) == 0) { + set_pageblock_migratetype(page, MIGRATE_MOVABLE); __free_pages_boot_core(page, pfn, MAX_ORDER-1); return; } @@ -4471,7 +4451,28 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, &nr_initialised)) break; } - __init_single_pfn(pfn, zone, nid); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if (!(pfn & (pageblock_nr_pages - 1))) { + struct page *page = pfn_to_page(pfn); + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + __init_single_page(page, pfn, zone, nid); + } else { + __init_single_pfn(pfn, zone, nid); + } } } -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f169.google.com (mail-wi0-f169.google.com [209.85.212.169]) by kanga.kvack.org (Postfix) with ESMTP id 716CA6B0080 for ; Thu, 23 Apr 2015 06:33:51 -0400 (EDT) Received: by widdi4 with SMTP id di4so210333567wid.0 for ; Thu, 23 Apr 2015 03:33:51 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id i7si13029569wjq.156.2015.04.23.03.33.29 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 03:33:30 -0700 (PDT) From: Mel Gorman Subject: [PATCH 13/13] mm: meminit: Remove mminit_verify_page_links Date: Thu, 23 Apr 2015 11:33:16 +0100 Message-Id: <1429785196-7668-14-git-send-email-mgorman@suse.de> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman mminit_verify_page_links() is an extremely paranoid check that was introduced when memory initialisation was being heavily reworked. Profiles indicated that up to 10% of parallel memory initialisation was spent on checking this for every page. The cost could be reduced but in practice this check only found problems very early during the initialisation rewrite and has found nothing since. This patch removes an expensive unnecessary check. Signed-off-by: Mel Gorman --- mm/internal.h | 8 -------- mm/mm_init.c | 8 -------- mm/page_alloc.c | 1 - 3 files changed, 17 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 2c4057140bec..c73ad248f8f4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -360,10 +360,7 @@ do { \ } while (0) extern void mminit_verify_pageflags_layout(void); -extern void mminit_verify_page_links(struct page *page, - enum zone_type zone, unsigned long nid, unsigned long pfn); extern void mminit_verify_zonelist(void); - #else static inline void mminit_dprintk(enum mminit_level level, @@ -375,11 +372,6 @@ static inline void mminit_verify_pageflags_layout(void) { } -static inline void mminit_verify_page_links(struct page *page, - enum zone_type zone, unsigned long nid, unsigned long pfn) -{ -} - static inline void mminit_verify_zonelist(void) { } diff --git a/mm/mm_init.c b/mm/mm_init.c index 28fbf87b20aa..fdadf918de76 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -131,14 +131,6 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, - unsigned long nid, unsigned long pfn) -{ - BUG_ON(page_to_nid(page) != nid); - BUG_ON(page_zonenum(page) != zone); - BUG_ON(page_to_pfn(page) != pfn); -} - static __init int set_mminit_loglevel(char *str) { get_option(&str, &mminit_loglevel); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 576b03bc9057..739b1840de2c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -853,7 +853,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn, unsigned long zone, int nid) { set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); page_cpupid_reset_last(page); -- 2.3.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f46.google.com (mail-la0-f46.google.com [209.85.215.46]) by kanga.kvack.org (Postfix) with ESMTP id 7BA986B0032 for ; Thu, 23 Apr 2015 11:54:17 -0400 (EDT) Received: by layy10 with SMTP id y10so15979187lay.0 for ; Thu, 23 Apr 2015 08:54:17 -0700 (PDT) Received: from numascale.com (numascale.com. [213.162.240.84]) by mx.google.com with ESMTPS id q3si6220861lah.142.2015.04.23.08.54.15 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 23 Apr 2015 08:54:15 -0700 (PDT) Date: Thu, 23 Apr 2015 23:53:57 +0800 From: Daniel J Blueman Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3 Message-Id: <1429804437.24139.3@cpanel21.proisp.no> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Andrew Morton , LKML , 'Steffen Persvold' On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman wrote: > The big change here is an adjustment to the topology_init path that > caused > soft lockups on Waiman and Daniel Blue had reported it was an > expensive > function. > > Changelog since v2 > o Reduce overhead of topology_init > o Remove boot-time kernel parameter to enable/disable > o Enable on UMA > > Changelog since v1 > o Always initialise low zones > o Typo corrections > o Rename parallel mem init to parallel struct page init > o Rebase to 4.0 [] Splendid work! On this 256c setup, topology_init now takes 185ms. This brings the kernel boot time down to 324s [1]. It turns out that one memset is responsible for most of the time setting up the the PUDs and PMDs; adapting memset to using non-temporal writes [3] avoids generating RMW cycles, bringing boot time down to 186s [2]. If this is a possibility, I can split this patch and map other arch's memset_nocache to memset, or change the callsite as preferred; comments welcome. Thanks, Daniel [1] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2.txt [2] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2-nontemporal.txt -- [3] From f822139736cab8434302693c635fa146b465273c Mon Sep 17 00:00:00 2001 From: Daniel J Blueman Date: Thu, 23 Apr 2015 23:26:27 +0800 Subject: [RFC] Speedup PMD setup Using non-temporal writes prevents read-modify-write cycles, which are much slower over large topologies. Adapt the existing memset() function into a _nocache variant and use when setting up PMDs during early boot to reduce boot time. Signed-off-by: Daniel J Blueman --- arch/x86/include/asm/string_64.h | 3 ++ arch/x86/lib/memset_64.S | 90 ++++++++++++++++++++++++++++++++++++++++ mm/memblock.c | 2 +- 3 files changed, 94 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index e466119..1ef28d0 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -55,6 +55,8 @@ extern void *memcpy(void *to, const void *from, size_t len); #define __HAVE_ARCH_MEMSET void *memset(void *s, int c, size_t n); void *__memset(void *s, int c, size_t n); +void *memset_nocache(void *s, int c, size_t n); +void *__memset_nocache(void *s, int c, size_t n); #define __HAVE_ARCH_MEMMOVE void *memmove(void *dest, const void *src, size_t count); @@ -77,6 +79,7 @@ int strcmp(const char *cs, const char *ct); #define memcpy(dst, src, len) __memcpy(dst, src, len) #define memmove(dst, src, len) __memmove(dst, src, len) #define memset(s, c, n) __memset(s, c, n) +#define memset_nocache(s, c, n) __memset_nocache(s, c, n) #endif #endif /* __KERNEL__ */ diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S index 6f44935..fb46f78 100644 --- a/arch/x86/lib/memset_64.S +++ b/arch/x86/lib/memset_64.S @@ -137,6 +137,96 @@ ENTRY(__memset) ENDPROC(memset) ENDPROC(__memset) +/* + * bzero_nocache - set a memory block to zero. This function uses + * non-temporal writes in the fastpath + * + * rdi destination + * rsi value (char) + * rdx count (bytes) + * + * rax original destination + */ + +ENTRY(memset_nocache) +ENTRY(__memset_nocache) + CFI_STARTPROC + movq %rdi,%r10 + + /* expand byte value */ + movzbl %sil,%ecx + movabs $0x0101010101010101,%rax + imulq %rcx,%rax + + /* align dst */ + movl %edi,%r9d + andl $7,%r9d + jnz bad_alignment + CFI_REMEMBER_STATE +after_bad_alignment: + + movq %rdx,%rcx + shrq $6,%rcx + jz handle_tail + + .p2align 4 +loop_64: + decq %rcx + movnti %rax,(%rdi) + movnti %rax,8(%rdi) + movnti %rax,16(%rdi) + movnti %rax,24(%rdi) + movnti %rax,32(%rdi) + movnti %rax,40(%rdi) + movnti %rax,48(%rdi) + movnti %rax,56(%rdi) + leaq 64(%rdi),%rdi + jnz loop_64 + + /* Handle tail in loops; the loops should be faster than hard + to predict jump tables */ + .p2align 4 +handle_tail: + movl %edx,%ecx + andl $63&(~7),%ecx + jz handle_7 + shrl $3,%ecx + .p2align 4 +loop_8: + decl %ecx + movnti %rax,(%rdi) + leaq 8(%rdi),%rdi + jnz loop_8 + +handle_7: + andl $7,%edx + jz ende + .p2align 4 +loop_1: + decl %edx + movb %al,(%rdi) + leaq 1(%rdi),%rdi + jnz loop_1 + +ende: + movq %r10,%rax + ret + + CFI_RESTORE_STATE +bad_alignment: + cmpq $7,%rdx + jbe handle_7 + movnti %rax,(%rdi) /* unaligned store */ + movq $8,%r8 + subq %r9,%r8 + addq %r8,%rdi + subq %r8,%rdx + jmp after_bad_alignment +final: + CFI_ENDPROC +ENDPROC(memset_nocache) +ENDPROC(__memset_nocache) + /* Some CPUs support enhanced REP MOVSB/STOSB feature. * It is recommended to use this when possible. * diff --git a/mm/memblock.c b/mm/memblock.c index f3e97d8..df434d2 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1212,7 +1212,7 @@ again: done: memblock_reserve(alloc, size); ptr = phys_to_virt(alloc); - memset(ptr, 0, size); + memset_nocache(ptr, 0, size); /* * The min_count is set to 0 so that bootmem allocated blocks -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f174.google.com (mail-wi0-f174.google.com [209.85.212.174]) by kanga.kvack.org (Postfix) with ESMTP id A427D6B0032 for ; Thu, 23 Apr 2015 11:56:13 -0400 (EDT) Received: by wiun10 with SMTP id n10so97577260wiu.1 for ; Thu, 23 Apr 2015 08:56:13 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id ge8si32667964wib.104.2015.04.23.08.56.11 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 08:56:12 -0700 (PDT) Date: Thu, 23 Apr 2015 16:56:07 +0100 From: Mel Gorman Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Message-ID: <20150423155607.GA2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-8-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1429785196-7668-8-git-send-email-mgorman@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML On Thu, Apr 23, 2015 at 11:33:10AM +0100, Mel Gorman wrote: > This patch initalises all low memory struct pages and 2G of the highest zone > on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT > is set. That config option cannot be set but will be available in a later > patch. Parallel initialisation of struct page depends on some features > from memory hotplug and it is necessary to alter alter section annotations. > > Signed-off-by: Mel Gorman I belatedly noticed that this causes section warnings. It'll be harmless for testing but the next (hopefully last) version will have this on top diff --git a/drivers/base/node.c b/drivers/base/node.c index d03e976b4431..97ab2c4dd39e 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -361,14 +361,14 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid) #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE #define page_initialized(page) (page->lru.next) -static int get_nid_for_pfn(struct pglist_data *pgdat, unsigned long pfn) +static int __init_refok get_nid_for_pfn(unsigned long pfn) { struct page *page; if (!pfn_valid_within(pfn)) return -1; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT - if (pgdat && pfn >= pgdat->first_deferred_pfn) + if (system_state == SYSTEM_BOOTING) return early_pfn_to_nid(pfn); #endif page = pfn_to_page(pfn); @@ -382,7 +382,6 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) { int ret; unsigned long pfn, sect_start_pfn, sect_end_pfn; - struct pglist_data *pgdat = NODE_DATA(nid); if (!mem_blk) return -EFAULT; @@ -395,7 +394,7 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { int page_nid; - page_nid = get_nid_for_pfn(pgdat, pfn); + page_nid = get_nid_for_pfn(pfn); if (page_nid < 0) continue; if (page_nid != nid) @@ -434,7 +433,7 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { int nid; - nid = get_nid_for_pfn(NULL, pfn); + nid = get_nid_for_pfn(pfn); if (nid < 0) continue; if (!node_online(nid)) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by kanga.kvack.org (Postfix) with ESMTP id 7D2F76B0032 for ; Thu, 23 Apr 2015 12:30:46 -0400 (EDT) Received: by wicmx19 with SMTP id mx19so16428917wic.1 for ; Thu, 23 Apr 2015 09:30:46 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id d10si15077901wix.109.2015.04.23.09.30.44 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Apr 2015 09:30:44 -0700 (PDT) Date: Thu, 23 Apr 2015 17:30:39 +0100 From: Mel Gorman Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3 Message-ID: <20150423163039.GB2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429804437.24139.3@cpanel21.proisp.no> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1429804437.24139.3@cpanel21.proisp.no> Sender: owner-linux-mm@kvack.org List-ID: To: Daniel J Blueman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Andrew Morton , LKML , 'Steffen Persvold' On Thu, Apr 23, 2015 at 11:53:57PM +0800, Daniel J Blueman wrote: > On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman wrote: > >The big change here is an adjustment to the topology_init path > >that caused > >soft lockups on Waiman and Daniel Blue had reported it was an > >expensive > >function. > > > >Changelog since v2 > >o Reduce overhead of topology_init > >o Remove boot-time kernel parameter to enable/disable > >o Enable on UMA > > > >Changelog since v1 > >o Always initialise low zones > >o Typo corrections > >o Rename parallel mem init to parallel struct page init > >o Rebase to 4.0 > [] > > Splendid work! On this 256c setup, topology_init now takes 185ms. > > This brings the kernel boot time down to 324s [1]. Good stuff. Am I correct in thinking that the vanilla kernel takes 732s? > It turns out that > one memset is responsible for most of the time setting up the the > PUDs and PMDs; adapting memset to using non-temporal writes [3] > avoids generating RMW cycles, bringing boot time down to 186s [2]. > > If this is a possibility, I can split this patch and map other > arch's memset_nocache to memset, or change the callsite as > preferred; comments welcome. > In general, I see no problem with the patch and that it would be useful going in before or after this series. I would suggest you splt this into three patches. The first that is an asm-generic alias of memset_nocache to memset with documentation saying it's optional for an architecture to implement. The second would be your implementation for x86 that needs to go to the x86 maintainers. The third would then be the memblock.c change. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f174.google.com (mail-ob0-f174.google.com [209.85.214.174]) by kanga.kvack.org (Postfix) with ESMTP id 7D8BC6B0032 for ; Fri, 24 Apr 2015 15:48:16 -0400 (EDT) Received: by obfe9 with SMTP id e9so45592463obf.1 for ; Fri, 24 Apr 2015 12:48:16 -0700 (PDT) Received: from g4t3426.houston.hp.com (g4t3426.houston.hp.com. [15.201.208.54]) by mx.google.com with ESMTPS id s204si8956448oia.32.2015.04.24.12.48.15 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 24 Apr 2015 12:48:15 -0700 (PDT) Message-ID: <553A9DFC.5040803@hp.com> Date: Fri, 24 Apr 2015 15:48:12 -0400 From: Waiman Long MIME-Version: 1.0 Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3 References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429804437.24139.3@cpanel21.proisp.no> In-Reply-To: <1429804437.24139.3@cpanel21.proisp.no> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Daniel J Blueman Cc: Mel Gorman , Linux-MM , Nathan Zimmer , Dave Hansen , Scott Norton , Andrew Morton , LKML , 'Steffen Persvold' On 04/23/2015 11:53 AM, Daniel J Blueman wrote: > On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman wrote: >> The big change here is an adjustment to the topology_init path that >> caused >> soft lockups on Waiman and Daniel Blue had reported it was an expensive >> function. >> >> Changelog since v2 >> o Reduce overhead of topology_init >> o Remove boot-time kernel parameter to enable/disable >> o Enable on UMA >> >> Changelog since v1 >> o Always initialise low zones >> o Typo corrections >> o Rename parallel mem init to parallel struct page init >> o Rebase to 4.0 > [] > > Splendid work! On this 256c setup, topology_init now takes 185ms. > > This brings the kernel boot time down to 324s [1]. It turns out that > one memset is responsible for most of the time setting up the the PUDs > and PMDs; adapting memset to using non-temporal writes [3] avoids > generating RMW cycles, bringing boot time down to 186s [2]. > > If this is a possibility, I can split this patch and map other arch's > memset_nocache to memset, or change the callsite as preferred; > comments welcome. > > Thanks, > Daniel > > [1] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2.txt > [2] > https://resources.numascale.com/telemetry/defermem/h8qgl-defer2-nontemporal.txt > > -- [3] > > From f822139736cab8434302693c635fa146b465273c Mon Sep 17 00:00:00 2001 > From: Daniel J Blueman > Date: Thu, 23 Apr 2015 23:26:27 +0800 > Subject: [RFC] Speedup PMD setup > > Using non-temporal writes prevents read-modify-write cycles, > which are much slower over large topologies. > > Adapt the existing memset() function into a _nocache variant and use > when setting up PMDs during early boot to reduce boot time. > > Signed-off-by: Daniel J Blueman > --- > arch/x86/include/asm/string_64.h | 3 ++ > arch/x86/lib/memset_64.S | 90 > ++++++++++++++++++++++++++++++++++++++++ > mm/memblock.c | 2 +- > 3 files changed, 94 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/string_64.h > b/arch/x86/include/asm/string_64.h > index e466119..1ef28d0 100644 > --- a/arch/x86/include/asm/string_64.h > +++ b/arch/x86/include/asm/string_64.h > @@ -55,6 +55,8 @@ extern void *memcpy(void *to, const void *from, > size_t len); > #define __HAVE_ARCH_MEMSET > void *memset(void *s, int c, size_t n); > void *__memset(void *s, int c, size_t n); > +void *memset_nocache(void *s, int c, size_t n); > +void *__memset_nocache(void *s, int c, size_t n); > > #define __HAVE_ARCH_MEMMOVE > void *memmove(void *dest, const void *src, size_t count); > @@ -77,6 +79,7 @@ int strcmp(const char *cs, const char *ct); > #define memcpy(dst, src, len) __memcpy(dst, src, len) > #define memmove(dst, src, len) __memmove(dst, src, len) > #define memset(s, c, n) __memset(s, c, n) > +#define memset_nocache(s, c, n) __memset_nocache(s, c, n) > #endif > > #endif /* __KERNEL__ */ > diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S > index 6f44935..fb46f78 100644 > --- a/arch/x86/lib/memset_64.S > +++ b/arch/x86/lib/memset_64.S > @@ -137,6 +137,96 @@ ENTRY(__memset) > ENDPROC(memset) > ENDPROC(__memset) > > +/* > + * bzero_nocache - set a memory block to zero. This function uses > + * non-temporal writes in the fastpath > + * > + * rdi destination > + * rsi value (char) > + * rdx count (bytes) > + * > + * rax original destination > + */ > + > +ENTRY(memset_nocache) > +ENTRY(__memset_nocache) > + CFI_STARTPROC > + movq %rdi,%r10 > + > + /* expand byte value */ > + movzbl %sil,%ecx > + movabs $0x0101010101010101,%rax > + imulq %rcx,%rax > + > + /* align dst */ > + movl %edi,%r9d > + andl $7,%r9d > + jnz bad_alignment > + CFI_REMEMBER_STATE > +after_bad_alignment: > + > + movq %rdx,%rcx > + shrq $6,%rcx > + jz handle_tail > + > + .p2align 4 > +loop_64: > + decq %rcx > + movnti %rax,(%rdi) > + movnti %rax,8(%rdi) > + movnti %rax,16(%rdi) > + movnti %rax,24(%rdi) > + movnti %rax,32(%rdi) > + movnti %rax,40(%rdi) > + movnti %rax,48(%rdi) > + movnti %rax,56(%rdi) > + leaq 64(%rdi),%rdi > + jnz loop_64 > + > + Your version of memset_nocache differs from from memset only in the use of movnti instruction. You may consider using compiler macros to make a single copy of source code to generate 2 different versions of executable codes. That will make the new code much easier to maintain. For example, #include ... #define MOVQ movnti #define memset memset_nocache #define __mmset __memset_nocache #include "memset_64.S" Of course, you need to replace the target movq instructions in memset_64.S to MOVQ, define #ifndef MOVQ #define MOVQ movq #endif You also need to use conditional compilation macro to disable the alternate instruction stuff in memset_64.S. Cheers, Longman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f174.google.com (mail-pd0-f174.google.com [209.85.192.174]) by kanga.kvack.org (Postfix) with ESMTP id 2F3E76B0038 for ; Mon, 27 Apr 2015 18:43:29 -0400 (EDT) Received: by pdbnk13 with SMTP id nk13so143197114pdb.0 for ; Mon, 27 Apr 2015 15:43:28 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id c7si31780250pdn.193.2015.04.27.15.43.28 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 27 Apr 2015 15:43:28 -0700 (PDT) Date: Mon, 27 Apr 2015 15:43:27 -0700 From: Andrew Morton Subject: Re: [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region Message-Id: <20150427154327.f7326dc16649ae402b5b5dd3@linux-foundation.org> In-Reply-To: <1429785196-7668-4-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-4-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Thu, 23 Apr 2015 11:33:06 +0100 Mel Gorman wrote: > From: Nathan Zimmer > > Currently we when we initialze each page struct is set as reserved upon > initialization. Hard to parse. I changed it to "Currently each page struct is set as reserved upon initialization". > This changes to starting with the reserved bit clear and > then only setting the bit in the reserved region. For what reason? A code comment over reserve_bootmem_region() would be a good way to answer that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f170.google.com (mail-pd0-f170.google.com [209.85.192.170]) by kanga.kvack.org (Postfix) with ESMTP id C05C16B006C for ; Mon, 27 Apr 2015 18:43:35 -0400 (EDT) Received: by pdbqa5 with SMTP id qa5so143178405pdb.1 for ; Mon, 27 Apr 2015 15:43:35 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id z14si31829863pdi.58.2015.04.27.15.43.34 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 27 Apr 2015 15:43:34 -0700 (PDT) Date: Mon, 27 Apr 2015 15:43:33 -0700 From: Andrew Morton Subject: Re: [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid Message-Id: <20150427154333.85a1fd2dbc38c7c0888fd4f5@linux-foundation.org> In-Reply-To: <1429785196-7668-6-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-6-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Thu, 23 Apr 2015 11:33:08 +0100 Mel Gorman wrote: > __early_pfn_to_nid() in the generic and arch-specific implementations > use static variables to cache recent lookups. Without the cache > boot times are much higher due to the excessive memblock lookups but > it assumes that memory initialisation is single-threaded. Parallel > initialisation of struct pages will break that assumption so this patch > makes __early_pfn_to_nid() SMP-safe by requiring the caller to cache > recent search information. early_pfn_to_nid() keeps the same interface > but is only safe to use early in boot due to the use of a global static > variable. meminit_pfn_in_nid() is an SMP-safe version that callers must > maintain their own state for. Seems a bit awkward. > +struct __meminitdata mminit_pfnnid_cache global_init_state; > + > +/* Only safe to use early in boot when initialisation is single-threaded */ > int __meminit early_pfn_to_nid(unsigned long pfn) > { > int nid; > > - nid = __early_pfn_to_nid(pfn); > + /* The system will behave unpredictably otherwise */ > + BUG_ON(system_state != SYSTEM_BOOTING); Because of this. Providing a cache per cpu: struct __meminitdata mminit_pfnnid_cache global_init_state[NR_CPUS]; would be simpler? Also, `global_init_state' is a poor name for a kernel-wide symbol. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f176.google.com (mail-pd0-f176.google.com [209.85.192.176]) by kanga.kvack.org (Postfix) with ESMTP id 22AF66B006E for ; Mon, 27 Apr 2015 18:43:46 -0400 (EDT) Received: by pdbqa5 with SMTP id qa5so143182059pdb.1 for ; Mon, 27 Apr 2015 15:43:45 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id kw15si31766287pab.203.2015.04.27.15.43.45 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 27 Apr 2015 15:43:45 -0700 (PDT) Date: Mon, 27 Apr 2015 15:43:44 -0700 From: Andrew Morton Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Message-Id: <20150427154344.421fd9f151bf27d365d02fd2@linux-foundation.org> In-Reply-To: <1429785196-7668-8-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-8-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Thu, 23 Apr 2015 11:33:10 +0100 Mel Gorman wrote: > This patch initalises all low memory struct pages and 2G of the highest zone > on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT > is set. That config option cannot be set but will be available in a later > patch. Parallel initialisation of struct page depends on some features > from memory hotplug and it is necessary to alter alter section annotations. > > ... > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > +#define __defermem_init __meminit > +#define __defer_init __meminit > +#else > +#define __defermem_init > +#define __defer_init __init > +#endif Could we get some comments describing these? What they do, when and where they should be used. I have a suspicion that the naming isn't good, but I didn't spend a lot of time reverse-engineering the intent... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by kanga.kvack.org (Postfix) with ESMTP id AC3526B0070 for ; Mon, 27 Apr 2015 18:43:52 -0400 (EDT) Received: by pacyx8 with SMTP id yx8so143915807pac.1 for ; Mon, 27 Apr 2015 15:43:52 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id m4si31783770pdp.192.2015.04.27.15.43.51 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 27 Apr 2015 15:43:52 -0700 (PDT) Date: Mon, 27 Apr 2015 15:43:50 -0700 From: Andrew Morton Subject: Re: [PATCH 08/13] mm: meminit: Initialise remaining struct pages in parallel with kswapd Message-Id: <20150427154350.4d649694a56e5bbc519e1fb4@linux-foundation.org> In-Reply-To: <1429785196-7668-9-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-9-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Thu, 23 Apr 2015 11:33:11 +0100 Mel Gorman wrote: > Only a subset of struct pages are initialised at the moment. When this patch > is applied kswapd initialise the remaining struct pages in parallel. This > should boot faster by spreading the work to multiple CPUs and initialising > data that is local to the CPU. The user-visible effect on large machines > is that free memory will appear to rapidly increase early in the lifetime > of the system until kswapd reports that all memory is initialised in the > kernel log. Once initialised there should be no other user-visibile effects. > > ... > > + pr_info("kswapd %d initialised deferred memory in %ums\n", nid, > + jiffies_to_msecs(jiffies - start)); It might be nice to tell people how much deferred memory kswapd initialised. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f181.google.com (mail-pd0-f181.google.com [209.85.192.181]) by kanga.kvack.org (Postfix) with ESMTP id 065136B0071 for ; Mon, 27 Apr 2015 18:43:58 -0400 (EDT) Received: by pdbqd1 with SMTP id qd1so143307479pdb.2 for ; Mon, 27 Apr 2015 15:43:57 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id pl10si31686040pbb.188.2015.04.27.15.43.57 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 27 Apr 2015 15:43:57 -0700 (PDT) Date: Mon, 27 Apr 2015 15:43:56 -0700 From: Andrew Morton Subject: Re: [PATCH 11/13] mm: meminit: Free pages in large chunks where possible Message-Id: <20150427154356.67e3d186b732a2c2b00e49cb@linux-foundation.org> In-Reply-To: <1429785196-7668-12-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-12-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Thu, 23 Apr 2015 11:33:14 +0100 Mel Gorman wrote: > Parallel struct page frees pages one at a time. Try free pages as single > large pages where possible. > > ... > > void __defermem_init deferred_init_memmap(int nid) This function is gruesome in an 80-col display. Even the code comments wrap, which is nuts. Maybe hoist the contents of the outermost loop into a separate function, called for each zone? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f174.google.com (mail-pd0-f174.google.com [209.85.192.174]) by kanga.kvack.org (Postfix) with ESMTP id 402336B0038 for ; Mon, 27 Apr 2015 18:46:35 -0400 (EDT) Received: by pdbqd1 with SMTP id qd1so143362024pdb.2 for ; Mon, 27 Apr 2015 15:46:35 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id n1si31776623pdf.241.2015.04.27.15.46.34 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 27 Apr 2015 15:46:34 -0700 (PDT) Date: Mon, 27 Apr 2015 15:46:33 -0700 From: Andrew Morton Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. Message-Id: <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> In-Reply-To: <1429785196-7668-3-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-3-git-send-email-mgorman@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman wrote: > From: Robin Holt : : host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1 : : Recipient address rejected: User unknown in virtual alias : table (in reply to RCPT TO command) Has Robin moved, or is SGI mail busted? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by kanga.kvack.org (Postfix) with ESMTP id 089696B0038 for ; Tue, 28 Apr 2015 04:28:41 -0400 (EDT) Received: by widdi4 with SMTP id di4so130100199wid.0 for ; Tue, 28 Apr 2015 01:28:40 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id om1si8566293wjc.104.2015.04.28.01.28.38 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 28 Apr 2015 01:28:39 -0700 (PDT) Date: Tue, 28 Apr 2015 09:28:31 +0100 From: Mel Gorman Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. Message-ID: <20150428082831.GI2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-3-git-send-email-mgorman@suse.de> <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote: > On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman wrote: > > > From: Robin Holt > > : : host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1 > : : Recipient address rejected: User unknown in virtual alias > : table (in reply to RCPT TO command) > > Has Robin moved, or is SGI mail busted? Robin has moved and I do not have an updated address for him. The address used in the patches was the one he posted the patches with. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f41.google.com (mail-wg0-f41.google.com [74.125.82.41]) by kanga.kvack.org (Postfix) with ESMTP id 296CB6B0038 for ; Tue, 28 Apr 2015 05:37:57 -0400 (EDT) Received: by wgen6 with SMTP id n6so144427480wge.3 for ; Tue, 28 Apr 2015 02:37:56 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id cr5si20439102wjb.214.2015.04.28.02.37.54 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 28 Apr 2015 02:37:55 -0700 (PDT) Date: Tue, 28 Apr 2015 10:37:51 +0100 From: Mel Gorman Subject: Re: [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid Message-ID: <20150428093751.GJ2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-6-git-send-email-mgorman@suse.de> <20150427154333.85a1fd2dbc38c7c0888fd4f5@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150427154333.85a1fd2dbc38c7c0888fd4f5@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Mon, Apr 27, 2015 at 03:43:33PM -0700, Andrew Morton wrote: > On Thu, 23 Apr 2015 11:33:08 +0100 Mel Gorman wrote: > > > __early_pfn_to_nid() in the generic and arch-specific implementations > > use static variables to cache recent lookups. Without the cache > > boot times are much higher due to the excessive memblock lookups but > > it assumes that memory initialisation is single-threaded. Parallel > > initialisation of struct pages will break that assumption so this patch > > makes __early_pfn_to_nid() SMP-safe by requiring the caller to cache > > recent search information. early_pfn_to_nid() keeps the same interface > > but is only safe to use early in boot due to the use of a global static > > variable. meminit_pfn_in_nid() is an SMP-safe version that callers must > > maintain their own state for. > > Seems a bit awkward. > I'm afraid I don't understand which part you mean. > > +struct __meminitdata mminit_pfnnid_cache global_init_state; > > + > > +/* Only safe to use early in boot when initialisation is single-threaded */ > > int __meminit early_pfn_to_nid(unsigned long pfn) > > { > > int nid; > > > > - nid = __early_pfn_to_nid(pfn); > > + /* The system will behave unpredictably otherwise */ > > + BUG_ON(system_state != SYSTEM_BOOTING); > > Because of this. > > Providing a cache per cpu: > > struct __meminitdata mminit_pfnnid_cache global_init_state[NR_CPUS]; > > would be simpler? > It would be simplier in terms of implementation but it's wasteful. We only need a small number of these caches early in boot. NR_CPUS is potentially very large. > > Also, `global_init_state' is a poor name for a kernel-wide symbol. You're right. It's not really global, it's just the one that is used if the caller does not track their own state. It should have been static and I renamed it to early_pfnnid_cache. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f48.google.com (mail-wg0-f48.google.com [74.125.82.48]) by kanga.kvack.org (Postfix) with ESMTP id E8DE96B0038 for ; Tue, 28 Apr 2015 05:53:29 -0400 (EDT) Received: by wgen6 with SMTP id n6so144876149wge.3 for ; Tue, 28 Apr 2015 02:53:29 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id ht6si17114462wib.102.2015.04.28.02.53.27 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 28 Apr 2015 02:53:28 -0700 (PDT) Date: Tue, 28 Apr 2015 10:53:23 +0100 From: Mel Gorman Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Message-ID: <20150428095323.GK2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-8-git-send-email-mgorman@suse.de> <20150427154344.421fd9f151bf27d365d02fd2@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150427154344.421fd9f151bf27d365d02fd2@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Mon, Apr 27, 2015 at 03:43:44PM -0700, Andrew Morton wrote: > On Thu, 23 Apr 2015 11:33:10 +0100 Mel Gorman wrote: > > > This patch initalises all low memory struct pages and 2G of the highest zone > > on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT > > is set. That config option cannot be set but will be available in a later > > patch. Parallel initialisation of struct page depends on some features > > from memory hotplug and it is necessary to alter alter section annotations. > > > > ... > > > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > > +#define __defermem_init __meminit > > +#define __defer_init __meminit > > +#else > > +#define __defermem_init > > +#define __defer_init __init > > +#endif > > Could we get some comments describing these? What they do, when and > where they should be used. I have a suspicion that the naming isn't > good, but I didn't spend a lot of time reverse-engineering the > intent... > Of course. The next version will have +/* + * Deferred struct page initialisation requires some early init functions that + * are removed before kswapd is up and running. The feature depends on memory + * hotplug so put the data and code required by deferred initialisation into + * the __meminit section where they are preserved. + */ -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f169.google.com (mail-wi0-f169.google.com [209.85.212.169]) by kanga.kvack.org (Postfix) with ESMTP id 5D1B36B0038 for ; Tue, 28 Apr 2015 07:38:25 -0400 (EDT) Received: by wizk4 with SMTP id k4so136569911wiz.1 for ; Tue, 28 Apr 2015 04:38:25 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id b20si4079352wjx.55.2015.04.28.04.38.23 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 28 Apr 2015 04:38:23 -0700 (PDT) Date: Tue, 28 Apr 2015 12:38:20 +0100 From: Mel Gorman Subject: Re: [PATCH 11/13] mm: meminit: Free pages in large chunks where possible Message-ID: <20150428113819.GL2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-12-git-send-email-mgorman@suse.de> <20150427154356.67e3d186b732a2c2b00e49cb@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150427154356.67e3d186b732a2c2b00e49cb@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Mon, Apr 27, 2015 at 03:43:56PM -0700, Andrew Morton wrote: > On Thu, 23 Apr 2015 11:33:14 +0100 Mel Gorman wrote: > > > Parallel struct page frees pages one at a time. Try free pages as single > > large pages where possible. > > > > ... > > > > void __defermem_init deferred_init_memmap(int nid) > > This function is gruesome in an 80-col display. Even the code comments > wrap, which is nuts. Maybe hoist the contents of the outermost loop > into a separate function, called for each zone? I can do better than that because only the highest zone is deferred in this version and the loop is no longer necessary. I should post a V4 before the end of my day that addresses your feedback. It caused a lot of conflicts and it'll be easier to replace the full series than try managing incremental fixes. Thanks Andrew. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f178.google.com (mail-pd0-f178.google.com [209.85.192.178]) by kanga.kvack.org (Postfix) with ESMTP id A714F6B006C for ; Tue, 28 Apr 2015 09:41:37 -0400 (EDT) Received: by pdbqa5 with SMTP id qa5so164322157pdb.1 for ; Tue, 28 Apr 2015 06:41:37 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id gy10si24455150pbd.243.2015.04.28.06.41.36 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 06:41:36 -0700 (PDT) Date: Tue, 28 Apr 2015 06:48:10 -0700 From: Andrew Morton Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Message-Id: <20150428064810.0882ad36.akpm@linux-foundation.org> In-Reply-To: <20150428095323.GK2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-8-git-send-email-mgorman@suse.de> <20150427154344.421fd9f151bf27d365d02fd2@linux-foundation.org> <20150428095323.GK2449@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Tue, 28 Apr 2015 10:53:23 +0100 Mel Gorman wrote: > > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > > > +#define __defermem_init __meminit > > > +#define __defer_init __meminit > > > +#else > > > +#define __defermem_init > > > +#define __defer_init __init > > > +#endif > > > > Could we get some comments describing these? What they do, when and > > where they should be used. I have a suspicion that the naming isn't > > good, but I didn't spend a lot of time reverse-engineering the > > intent... > > > > Of course. The next version will have > > +/* > + * Deferred struct page initialisation requires some early init functions that > + * are removed before kswapd is up and running. The feature depends on memory > + * hotplug so put the data and code required by deferred initialisation into > + * the __meminit section where they are preserved. > + */ I'm still not getting it even a little bit :( You say "data and code", so I'd expect to see #define __defer_meminitdata __meminitdata #define __defer_meminit __meminit But the patch doesn't mention the data segment at all. The patch uses both __defermem_init and __defer_init to tag functions (ie: text) and I can't work out why. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f48.google.com (mail-wg0-f48.google.com [74.125.82.48]) by kanga.kvack.org (Postfix) with ESMTP id 15FAA6B006E for ; Tue, 28 Apr 2015 10:56:37 -0400 (EDT) Received: by wgen6 with SMTP id n6so154363038wge.3 for ; Tue, 28 Apr 2015 07:56:36 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id ez17si38857399wjc.157.2015.04.28.07.56.35 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 28 Apr 2015 07:56:35 -0700 (PDT) Date: Tue, 28 Apr 2015 15:56:32 +0100 From: Mel Gorman Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Message-ID: <20150428145632.GN2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-8-git-send-email-mgorman@suse.de> <20150427154344.421fd9f151bf27d365d02fd2@linux-foundation.org> <20150428095323.GK2449@suse.de> <20150428064810.0882ad36.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150428064810.0882ad36.akpm@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Tue, Apr 28, 2015 at 06:48:10AM -0700, Andrew Morton wrote: > On Tue, 28 Apr 2015 10:53:23 +0100 Mel Gorman wrote: > > > > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > > > > +#define __defermem_init __meminit > > > > +#define __defer_init __meminit > > > > +#else > > > > +#define __defermem_init > > > > +#define __defer_init __init > > > > +#endif > > > > > > Could we get some comments describing these? What they do, when and > > > where they should be used. I have a suspicion that the naming isn't > > > good, but I didn't spend a lot of time reverse-engineering the > > > intent... > > > > > > > Of course. The next version will have > > > > +/* > > + * Deferred struct page initialisation requires some early init functions that > > + * are removed before kswapd is up and running. The feature depends on memory > > + * hotplug so put the data and code required by deferred initialisation into > > + * the __meminit section where they are preserved. > > + */ > > I'm still not getting it even a little bit :( You say "data and code", > so I'd expect to see > > #define __defer_meminitdata __meminitdata > #define __defer_meminit __meminit > > But the patch doesn't mention the data segment at all. > Take 2. Suggestions on different names are welcome because they are poor. /* * Deferred struct page initialisation requires init functions that are freed * before kswapd is available. Reuse the memory hotplug section annotation * to mark the required code. * * __defermem_init is code that always exists but is annotated __meminit to * avoid section warnings. * __defer_init code gets marked __meminit when deferring struct page * initialistion but is otherwise in the init section. */ -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f179.google.com (mail-ig0-f179.google.com [209.85.213.179]) by kanga.kvack.org (Postfix) with ESMTP id DF0FC6B0075 for ; Tue, 28 Apr 2015 12:02:53 -0400 (EDT) Received: by igblo3 with SMTP id lo3so93526545igb.1 for ; Tue, 28 Apr 2015 09:02:53 -0700 (PDT) Received: from relay.sgi.com (relay2.sgi.com. [192.48.180.65]) by mx.google.com with ESMTP id mv8si8775571igb.62.2015.04.28.09.02.52 for ; Tue, 28 Apr 2015 09:02:52 -0700 (PDT) Message-ID: <553FAF26.9060609@sgi.com> Date: Tue, 28 Apr 2015 11:02:46 -0500 From: nzimmer MIME-Version: 1.0 Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-3-git-send-email-mgorman@suse.de> <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> <20150428082831.GI2449@suse.de> In-Reply-To: <20150428082831.GI2449@suse.de> Content-Type: text/plain; charset="iso-8859-15"; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Andrew Morton Cc: Linux-MM , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML This is the one I have, but I haven't had a chance to talk with him in a long time. robinmholt@gmail.com On 04/28/2015 03:28 AM, Mel Gorman wrote: > On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote: >> On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman wrote: >> >>> From: Robin Holt >> : : host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1 >> : : Recipient address rejected: User unknown in virtual alias >> : table (in reply to RCPT TO command) >> >> Has Robin moved, or is SGI mail busted? > Robin has moved and I do not have an updated address for him. The > address used in the patches was the one he posted the patches with. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by kanga.kvack.org (Postfix) with ESMTP id B5C0B6B006E for ; Tue, 28 Apr 2015 18:41:02 -0400 (EDT) Received: by pabtp1 with SMTP id tp1so8846568pab.2 for ; Tue, 28 Apr 2015 15:41:02 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id ir4si36518068pbc.118.2015.04.28.15.41.01 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 15:41:01 -0700 (PDT) Date: Tue, 28 Apr 2015 15:41:00 -0700 From: Andrew Morton Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. Message-Id: <20150428154100.0f6bd333620b2e744ee66221@linux-foundation.org> In-Reply-To: <20150428082831.GI2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-3-git-send-email-mgorman@suse.de> <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> <20150428082831.GI2449@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML On Tue, 28 Apr 2015 09:28:31 +0100 Mel Gorman wrote: > On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote: > > On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman wrote: > > > > > From: Robin Holt > > > > : : host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1 > > : : Recipient address rejected: User unknown in virtual alias > > : table (in reply to RCPT TO command) > > > > Has Robin moved, or is SGI mail busted? > > Robin has moved and I do not have an updated address for him. The > address used in the patches was the one he posted the patches with. > As Nathan mentioned, z:/usr/src/git26> git log | grep "Robin Holt" Cc: Robin Holt Acked-by: Robin Holt Cc: Robin Holt Cc: Robin Holt Cc: Robin Holt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f170.google.com (mail-wi0-f170.google.com [209.85.212.170]) by kanga.kvack.org (Postfix) with ESMTP id 976406B0032 for ; Tue, 28 Apr 2015 19:05:12 -0400 (EDT) Received: by wizk4 with SMTP id k4so159065692wiz.1 for ; Tue, 28 Apr 2015 16:05:12 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id dj6si20449431wib.22.2015.04.28.16.05.10 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 28 Apr 2015 16:05:11 -0700 (PDT) Date: Wed, 29 Apr 2015 00:05:06 +0100 From: Mel Gorman Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. Message-ID: <20150428230506.GP2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-3-git-send-email-mgorman@suse.de> <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> <20150428082831.GI2449@suse.de> <20150428154100.0f6bd333620b2e744ee66221@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150428154100.0f6bd333620b2e744ee66221@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Robin Holt , LKML On Tue, Apr 28, 2015 at 03:41:00PM -0700, Andrew Morton wrote: > On Tue, 28 Apr 2015 09:28:31 +0100 Mel Gorman wrote: > > > On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote: > > > On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman wrote: > > > > > > > From: Robin Holt > > > > > > : : host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1 > > > : : Recipient address rejected: User unknown in virtual alias > > > : table (in reply to RCPT TO command) > > > > > > Has Robin moved, or is SGI mail busted? > > > > Robin has moved and I do not have an updated address for him. The > > address used in the patches was the one he posted the patches with. > > > > As Nathan mentioned, > > z:/usr/src/git26> git log | grep "Robin Holt" > Cc: Robin Holt > Acked-by: Robin Holt > Cc: Robin Holt > Cc: Robin Holt > Cc: Robin Holt I can update the address if Robin wishes (cc'd). I was preserving the address that was used to actually sign off the patches as that was the history. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f176.google.com (mail-ig0-f176.google.com [209.85.213.176]) by kanga.kvack.org (Postfix) with ESMTP id D834F6B0032 for ; Tue, 28 Apr 2015 21:31:53 -0400 (EDT) Received: by igbhj9 with SMTP id hj9so37178085igb.1 for ; Tue, 28 Apr 2015 18:31:53 -0700 (PDT) Received: from g2t2354.austin.hp.com (g2t2354.austin.hp.com. [15.217.128.53]) by mx.google.com with ESMTPS id ka10si10043017igb.53.2015.04.28.18.31.53 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 28 Apr 2015 18:31:53 -0700 (PDT) Message-ID: <55403484.8060906@hp.com> Date: Tue, 28 Apr 2015 21:31:48 -0400 From: Waiman Long MIME-Version: 1.0 Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3 References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429804437.24139.3@cpanel21.proisp.no> In-Reply-To: <1429804437.24139.3@cpanel21.proisp.no> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Daniel J Blueman Cc: Mel Gorman , Linux-MM , Nathan Zimmer , Dave Hansen , Scott Norton , Andrew Morton , LKML , 'Steffen Persvold' On 04/23/2015 11:53 AM, Daniel J Blueman wrote: > On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman wrote: >> The big change here is an adjustment to the topology_init path that >> caused >> soft lockups on Waiman and Daniel Blue had reported it was an expensive >> function. >> >> Changelog since v2 >> o Reduce overhead of topology_init >> o Remove boot-time kernel parameter to enable/disable >> o Enable on UMA >> >> Changelog since v1 >> o Always initialise low zones >> o Typo corrections >> o Rename parallel mem init to parallel struct page init >> o Rebase to 4.0 > [] > > Splendid work! On this 256c setup, topology_init now takes 185ms. > > This brings the kernel boot time down to 324s [1]. It turns out that > one memset is responsible for most of the time setting up the the PUDs > and PMDs; adapting memset to using non-temporal writes [3] avoids > generating RMW cycles, bringing boot time down to 186s [2]. > > If this is a possibility, I can split this patch and map other arch's > memset_nocache to memset, or change the callsite as preferred; > comments welcome. > > Thanks, > Daniel > > [1] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2.txt > [2] > https://resources.numascale.com/telemetry/defermem/h8qgl-defer2-nontemporal.txt > > -- [3] > > From f822139736cab8434302693c635fa146b465273c Mon Sep 17 00:00:00 2001 > From: Daniel J Blueman > Date: Thu, 23 Apr 2015 23:26:27 +0800 > Subject: [RFC] Speedup PMD setup > > Using non-temporal writes prevents read-modify-write cycles, > which are much slower over large topologies. > > Adapt the existing memset() function into a _nocache variant and use > when setting up PMDs during early boot to reduce boot time. > > Signed-off-by: Daniel J Blueman > --- > arch/x86/include/asm/string_64.h | 3 ++ > arch/x86/lib/memset_64.S | 90 > ++++++++++++++++++++++++++++++++++++++++ > mm/memblock.c | 2 +- > 3 files changed, 94 insertions(+), 1 deletion(-) > I tried your patch on my 12-TB IvyBridge-EX test machine and the bootup time increased from 265s to 289s (24s increase). I think my IvyBridge-EX box was using the optimized memset_c_e (rep stosb) code which turned out to perform better than the non-temporal move in your code. I think that may be due to the temporal moves that need to be done at the beginning and end of the memory range. I had tried to replace clear_page() with non-temporal moves. I generally got about a few percentage points improvement compared with the optimized clear_page_c() and clear_page_c_e() code. That is not a lot. Anyway, I think the AMD box that you used wasn't setting the X86_FEATURE_REP_GOOD or X86_FEATURE_ERMS bits resulting in poor memset performance. If such a feature is supported in the AMD CPU (albeit in a different way), you may consider sending in patch to set those features bit. Alternatively, you will need to duplicate the alternative instruction stuff in your memset_nocache() to make sure that it can use the optimized code, if appropriate. Cheers, Longman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933613AbbDWKdZ (ORCPT ); Thu, 23 Apr 2015 06:33:25 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49891 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932910AbbDWKdU (ORCPT ); Thu, 23 Apr 2015 06:33:20 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 0/13] Parallel struct page initialisation v3 Date: Thu, 23 Apr 2015 11:33:03 +0100 Message-Id: <1429785196-7668-1-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The big change here is an adjustment to the topology_init path that caused soft lockups on Waiman and Daniel Blue had reported it was an expensive function. Changelog since v2 o Reduce overhead of topology_init o Remove boot-time kernel parameter to enable/disable o Enable on UMA Changelog since v1 o Always initialise low zones o Typo corrections o Rename parallel mem init to parallel struct page init o Rebase to 4.0 Struct page initialisation had been identified as one of the reasons why large machines take a long time to boot. Patches were posted a long time ago to defer initialisation until they were first used. This was rejected on the grounds it should not be necessary to hurt the fast paths. This series reuses much of the work from that time but defers the initialisation of memory to kswapd so that one thread per node initialises memory local to that node. After applying the series and setting the appropriate Kconfig variable I see this in the boot log on a 64G machine [ 7.383764] kswapd 0 initialised deferred memory in 188ms [ 7.404253] kswapd 1 initialised deferred memory in 208ms [ 7.411044] kswapd 3 initialised deferred memory in 216ms [ 7.411551] kswapd 2 initialised deferred memory in 216ms On a 1TB machine, I see [ 8.406511] kswapd 3 initialised deferred memory in 1116ms [ 8.428518] kswapd 1 initialised deferred memory in 1140ms [ 8.435977] kswapd 0 initialised deferred memory in 1148ms [ 8.437416] kswapd 2 initialised deferred memory in 1148ms Once booted the machine appears to work as normal. Boot times were measured from the time shutdown was called until ssh was available again. In the 64G case, the boot time savings are negligible. On the 1TB machine, the savings were 16 seconds. It would be nice if the people that have access to really large machines would test this series and report how much boot time is reduced. arch/ia64/mm/numa.c | 19 +-- arch/x86/Kconfig | 1 + drivers/base/node.c | 11 +- include/linux/memblock.h | 18 +++ include/linux/mm.h | 8 +- include/linux/mmzone.h | 23 ++- mm/Kconfig | 18 +++ mm/bootmem.c | 8 +- mm/internal.h | 23 ++- mm/memblock.c | 34 ++++- mm/mm_init.c | 9 +- mm/nobootmem.c | 7 +- mm/page_alloc.c | 379 ++++++++++++++++++++++++++++++++++++++++------- mm/vmscan.c | 6 +- 14 files changed, 462 insertions(+), 102 deletions(-) -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933695AbbDWKd0 (ORCPT ); Thu, 23 Apr 2015 06:33:26 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49890 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933106AbbDWKdU (ORCPT ); Thu, 23 Apr 2015 06:33:20 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 01/13] memblock: Introduce a for_each_reserved_mem_region iterator. Date: Thu, 23 Apr 2015 11:33:04 +0100 Message-Id: <1429785196-7668-2-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Robin Holt As part of initializing struct page's in 2MiB chunks, we noticed that at the end of free_all_bootmem(), there was nothing which had forced the reserved/allocated 4KiB pages to be initialized. This helper function will be used for that expansion. Signed-off-by: Robin Holt Signed-off-by: Nate Zimmer Signed-off-by: Mel Gorman --- include/linux/memblock.h | 18 ++++++++++++++++++ mm/memblock.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index e8cc45307f8f..3075e7673c54 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -93,6 +93,9 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a, struct memblock_type *type_b, phys_addr_t *out_start, phys_addr_t *out_end, int *out_nid); +void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, + phys_addr_t *out_end); + /** * for_each_mem_range - iterate through memblock areas from type_a and not * included in type_b. Or just type_a if type_b is NULL. @@ -132,6 +135,21 @@ void __next_mem_range_rev(u64 *idx, int nid, struct memblock_type *type_a, __next_mem_range_rev(&i, nid, type_a, type_b, \ p_start, p_end, p_nid)) +/** + * for_each_reserved_mem_region - iterate over all reserved memblock areas + * @i: u64 used as loop variable + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over reserved areas of memblock. Available as soon as memblock + * is initialized. + */ +#define for_each_reserved_mem_region(i, p_start, p_end) \ + for (i = 0UL, \ + __next_reserved_mem_region(&i, p_start, p_end); \ + i != (u64)ULLONG_MAX; \ + __next_reserved_mem_region(&i, p_start, p_end)) + #ifdef CONFIG_MOVABLE_NODE static inline bool memblock_is_hotpluggable(struct memblock_region *m) { diff --git a/mm/memblock.c b/mm/memblock.c index 252b77bdf65e..e0cc2d174f74 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -765,6 +765,38 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size) } /** + * __next_reserved_mem_region - next function for for_each_reserved_region() + * @idx: pointer to u64 loop variable + * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL + * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL + * + * Iterate over all reserved memory regions. + */ +void __init_memblock __next_reserved_mem_region(u64 *idx, + phys_addr_t *out_start, + phys_addr_t *out_end) +{ + struct memblock_type *rsv = &memblock.reserved; + + if (*idx >= 0 && *idx < rsv->cnt) { + struct memblock_region *r = &rsv->regions[*idx]; + phys_addr_t base = r->base; + phys_addr_t size = r->size; + + if (out_start) + *out_start = base; + if (out_end) + *out_end = base + size - 1; + + *idx += 1; + return; + } + + /* signal end of iteration */ + *idx = ULLONG_MAX; +} + +/** * __next__mem_range - next function for for_each_free_mem_range() etc. * @idx: pointer to u64 loop variable * @nid: node selector, %NUMA_NO_NODE for all nodes -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933897AbbDWKgE (ORCPT ); Thu, 23 Apr 2015 06:36:04 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49901 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933280AbbDWKdV (ORCPT ); Thu, 23 Apr 2015 06:33:21 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. Date: Thu, 23 Apr 2015 11:33:05 +0100 Message-Id: <1429785196-7668-3-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Robin Holt Currently, memmap_init_zone() has all the smarts for initializing a single page. A subset of this is required for parallel page initialisation and so this patch breaks up the monolithic function in preparation. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer Signed-off-by: Mel Gorman --- mm/page_alloc.c | 79 +++++++++++++++++++++++++++++++++------------------------ 1 file changed, 46 insertions(+), 33 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 40e29429e7b0..fd7a6d09062d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -778,6 +778,51 @@ static int free_tail_pages_check(struct page *head_page, struct page *page) return 0; } +static void __meminit __init_single_page(struct page *page, unsigned long pfn, + unsigned long zone, int nid) +{ + struct zone *z = &NODE_DATA(nid)->node_zones[zone]; + + set_page_links(page, zone, nid, pfn); + mminit_verify_page_links(page, zone, nid, pfn); + init_page_count(page); + page_mapcount_reset(page); + page_cpupid_reset_last(page); + SetPageReserved(page); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if ((z->zone_start_pfn <= pfn) + && (pfn < zone_end_pfn(z)) + && !(pfn & (pageblock_nr_pages - 1))) + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + + INIT_LIST_HEAD(&page->lru); +#ifdef WANT_PAGE_VIRTUAL + /* The shift won't overflow because ZONE_NORMAL is below 4G. */ + if (!is_highmem_idx(zone)) + set_page_address(page, __va(pfn << PAGE_SHIFT)); +#endif +} + +static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone, + int nid) +{ + return __init_single_page(pfn_to_page(pfn), pfn, zone, nid); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { bool compound = PageCompound(page); @@ -4124,7 +4169,6 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { - struct page *page; unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; @@ -4145,38 +4189,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!early_pfn_in_nid(pfn, nid)) continue; } - page = pfn_to_page(pfn); - set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); - init_page_count(page); - page_mapcount_reset(page); - page_cpupid_reset_last(page); - SetPageReserved(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - - INIT_LIST_HEAD(&page->lru); -#ifdef WANT_PAGE_VIRTUAL - /* The shift won't overflow because ZONE_NORMAL is below 4G. */ - if (!is_highmem_idx(zone)) - set_page_address(page, __va(pfn << PAGE_SHIFT)); -#endif + __init_single_pfn(pfn, zone, nid); } } -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965318AbbDWKgA (ORCPT ); Thu, 23 Apr 2015 06:36:00 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49912 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933196AbbDWKdV (ORCPT ); Thu, 23 Apr 2015 06:33:21 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region Date: Thu, 23 Apr 2015 11:33:06 +0100 Message-Id: <1429785196-7668-4-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Nathan Zimmer Currently we when we initialze each page struct is set as reserved upon initialization. This changes to starting with the reserved bit clear and then only setting the bit in the reserved region. Signed-off-by: Robin Holt Signed-off-by: Nathan Zimmer --- include/linux/mm.h | 2 ++ mm/nobootmem.c | 3 +++ mm/page_alloc.c | 11 ++++++++++- 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 47a93928b90f..b6f82a31028a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1711,6 +1711,8 @@ extern void free_highmem_page(struct page *page); extern void adjust_managed_page_count(struct page *page, long count); extern void mem_init_print_info(const char *str); +extern void reserve_bootmem_region(unsigned long start, unsigned long end); + /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) { diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 90b50468333e..396f9e450dc1 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -121,6 +121,9 @@ static unsigned long __init free_low_memory_core_early(void) memblock_clear_hotplug(0, -1); + for_each_reserved_mem_region(i, &start, &end) + reserve_bootmem_region(start, end); + for_each_free_mem_range(i, NUMA_NO_NODE, &start, &end, NULL) count += __free_memory_core(start, end); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fd7a6d09062d..2abb3b861e70 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -788,7 +788,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn, init_page_count(page); page_mapcount_reset(page); page_cpupid_reset_last(page); - SetPageReserved(page); /* * Mark the block movable so that blocks are reserved for @@ -823,6 +822,16 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone, return __init_single_page(pfn_to_page(pfn), pfn, zone, nid); } +void reserve_bootmem_region(unsigned long start, unsigned long end) +{ + unsigned long start_pfn = PFN_DOWN(start); + unsigned long end_pfn = PFN_UP(end); + + for (; start_pfn < end_pfn; start_pfn++) + if (pfn_valid(start_pfn)) + SetPageReserved(pfn_to_page(start_pfn)); +} + static bool free_pages_prepare(struct page *page, unsigned int order) { bool compound = PageCompound(page); -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965262AbbDWKf5 (ORCPT ); Thu, 23 Apr 2015 06:35:57 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49921 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933444AbbDWKdW (ORCPT ); Thu, 23 Apr 2015 06:33:22 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 04/13] mm: page_alloc: Pass PFN to __free_pages_bootmem Date: Thu, 23 Apr 2015 11:33:07 +0100 Message-Id: <1429785196-7668-5-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org __free_pages_bootmem prepares a page for release to the buddy allocator and assumes that the struct page is initialised. Parallel initialisation of struct pages defers initialisation and __free_pages_bootmem can be called for struct pages that cannot yet map struct page to PFN. This patch passes PFN to __free_pages_bootmem with no other functional change. Signed-off-by: Mel Gorman --- mm/bootmem.c | 8 ++++---- mm/internal.h | 3 ++- mm/memblock.c | 2 +- mm/nobootmem.c | 4 ++-- mm/page_alloc.c | 3 ++- 5 files changed, 11 insertions(+), 9 deletions(-) diff --git a/mm/bootmem.c b/mm/bootmem.c index 477be696511d..daf956bb4782 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -164,7 +164,7 @@ void __init free_bootmem_late(unsigned long physaddr, unsigned long size) end = PFN_DOWN(physaddr + size); for (; cursor < end; cursor++) { - __free_pages_bootmem(pfn_to_page(cursor), 0); + __free_pages_bootmem(pfn_to_page(cursor), cursor, 0); totalram_pages++; } } @@ -210,7 +210,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata) if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL) { int order = ilog2(BITS_PER_LONG); - __free_pages_bootmem(pfn_to_page(start), order); + __free_pages_bootmem(pfn_to_page(start), start, order); count += BITS_PER_LONG; start += BITS_PER_LONG; } else { @@ -220,7 +220,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata) while (vec && cur != start) { if (vec & 1) { page = pfn_to_page(cur); - __free_pages_bootmem(page, 0); + __free_pages_bootmem(page, cur, 0); count++; } vec >>= 1; @@ -234,7 +234,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata) pages = bootmem_bootmap_pages(pages); count += pages; while (pages--) - __free_pages_bootmem(page++, 0); + __free_pages_bootmem(page++, cur++, 0); bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count); diff --git a/mm/internal.h b/mm/internal.h index a96da5b0029d..76b605139c7a 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -155,7 +155,8 @@ __find_buddy_index(unsigned long page_idx, unsigned int order) } extern int __isolate_free_page(struct page *page, unsigned int order); -extern void __free_pages_bootmem(struct page *page, unsigned int order); +extern void __free_pages_bootmem(struct page *page, unsigned long pfn, + unsigned int order); extern void prep_compound_page(struct page *page, unsigned long order); #ifdef CONFIG_MEMORY_FAILURE extern bool is_free_buddy_page(struct page *page); diff --git a/mm/memblock.c b/mm/memblock.c index e0cc2d174f74..f3e97d8eeb5c 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1334,7 +1334,7 @@ void __init __memblock_free_late(phys_addr_t base, phys_addr_t size) end = PFN_DOWN(base + size); for (; cursor < end; cursor++) { - __free_pages_bootmem(pfn_to_page(cursor), 0); + __free_pages_bootmem(pfn_to_page(cursor), cursor, 0); totalram_pages++; } } diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 396f9e450dc1..bae652713ee5 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -77,7 +77,7 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size) end = PFN_DOWN(addr + size); for (; cursor < end; cursor++) { - __free_pages_bootmem(pfn_to_page(cursor), 0); + __free_pages_bootmem(pfn_to_page(cursor), cursor, 0); totalram_pages++; } } @@ -92,7 +92,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end) while (start + (1UL << order) > end) order--; - __free_pages_bootmem(pfn_to_page(start), order); + __free_pages_bootmem(pfn_to_page(start), start, order); start += (1UL << order); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2abb3b861e70..0a0e0f280d87 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -886,7 +886,8 @@ static void __free_pages_ok(struct page *page, unsigned int order) local_irq_restore(flags); } -void __init __free_pages_bootmem(struct page *page, unsigned int order) +void __init __free_pages_bootmem(struct page *page, unsigned long pfn, + unsigned int order) { unsigned int nr_pages = 1 << order; struct page *p = page; -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965220AbbDWKfx (ORCPT ); Thu, 23 Apr 2015 06:35:53 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49925 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933486AbbDWKdX (ORCPT ); Thu, 23 Apr 2015 06:33:23 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid Date: Thu, 23 Apr 2015 11:33:08 +0100 Message-Id: <1429785196-7668-6-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org __early_pfn_to_nid() in the generic and arch-specific implementations use static variables to cache recent lookups. Without the cache boot times are much higher due to the excessive memblock lookups but it assumes that memory initialisation is single-threaded. Parallel initialisation of struct pages will break that assumption so this patch makes __early_pfn_to_nid() SMP-safe by requiring the caller to cache recent search information. early_pfn_to_nid() keeps the same interface but is only safe to use early in boot due to the use of a global static variable. meminit_pfn_in_nid() is an SMP-safe version that callers must maintain their own state for. Signed-off-by: Mel Gorman --- arch/ia64/mm/numa.c | 19 +++++++------------ include/linux/mm.h | 6 ++++-- include/linux/mmzone.h | 16 +++++++++++++++- mm/page_alloc.c | 40 +++++++++++++++++++++++++--------------- 4 files changed, 51 insertions(+), 30 deletions(-) diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c index ea21d4cad540..aa19b7ac8222 100644 --- a/arch/ia64/mm/numa.c +++ b/arch/ia64/mm/numa.c @@ -58,27 +58,22 @@ paddr_to_nid(unsigned long paddr) * SPARSEMEM to allocate the SPARSEMEM sectionmap on the NUMA node where * the section resides. */ -int __meminit __early_pfn_to_nid(unsigned long pfn) +int __meminit __early_pfn_to_nid(unsigned long pfn, + struct mminit_pfnnid_cache *state) { int i, section = pfn >> PFN_SECTION_SHIFT, ssec, esec; - /* - * NOTE: The following SMP-unsafe globals are only used early in boot - * when the kernel is running single-threaded. - */ - static int __meminitdata last_ssec, last_esec; - static int __meminitdata last_nid; - if (section >= last_ssec && section < last_esec) - return last_nid; + if (section >= state->last_start && section < state->last_end) + return state->last_nid; for (i = 0; i < num_node_memblks; i++) { ssec = node_memblk[i].start_paddr >> PA_SECTION_SHIFT; esec = (node_memblk[i].start_paddr + node_memblk[i].size + ((1L << PA_SECTION_SHIFT) - 1)) >> PA_SECTION_SHIFT; if (section >= ssec && section < esec) { - last_ssec = ssec; - last_esec = esec; - last_nid = node_memblk[i].nid; + state->last_start = ssec; + state->last_end = esec; + state->last_nid = node_memblk[i].nid; return node_memblk[i].nid; } } diff --git a/include/linux/mm.h b/include/linux/mm.h index b6f82a31028a..a8a8b161fd65 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1802,7 +1802,8 @@ extern void sparse_memory_present_with_active_regions(int nid); #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \ !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) -static inline int __early_pfn_to_nid(unsigned long pfn) +static inline int __early_pfn_to_nid(unsigned long pfn, + struct mminit_pfnnid_cache *state) { return 0; } @@ -1810,7 +1811,8 @@ static inline int __early_pfn_to_nid(unsigned long pfn) /* please see mm/page_alloc.c */ extern int __meminit early_pfn_to_nid(unsigned long pfn); /* there is a per-arch backend function. */ -extern int __meminit __early_pfn_to_nid(unsigned long pfn); +extern int __meminit __early_pfn_to_nid(unsigned long pfn, + struct mminit_pfnnid_cache *state); #endif extern void set_dma_reserve(unsigned long new_dma_reserve); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 2782df47101e..a67b33e52dfe 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1216,10 +1216,24 @@ void sparse_init(void); #define sparse_index_init(_sec, _nid) do {} while (0) #endif /* CONFIG_SPARSEMEM */ +/* + * During memory init memblocks map pfns to nids. The search is expensive and + * this caches recent lookups. The implementation of __early_pfn_to_nid + * may treat start/end as pfns or sections. + */ +struct mminit_pfnnid_cache { + unsigned long last_start; + unsigned long last_end; + int last_nid; +}; + #ifdef CONFIG_NODES_SPAN_OTHER_NODES bool early_pfn_in_nid(unsigned long pfn, int nid); +bool meminit_pfn_in_nid(unsigned long pfn, int node, + struct mminit_pfnnid_cache *state); #else -#define early_pfn_in_nid(pfn, nid) (1) +#define early_pfn_in_nid(pfn, nid) (1) +#define meminit_pfn_in_nid(pfn, nid, state) (1) #endif #ifndef early_pfn_valid diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0a0e0f280d87..f556ed63b964 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4457,39 +4457,41 @@ int __meminit init_currently_empty_zone(struct zone *zone, #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID + /* * Required by SPARSEMEM. Given a PFN, return what node the PFN is on. */ -int __meminit __early_pfn_to_nid(unsigned long pfn) +int __meminit __early_pfn_to_nid(unsigned long pfn, + struct mminit_pfnnid_cache *state) { unsigned long start_pfn, end_pfn; int nid; - /* - * NOTE: The following SMP-unsafe globals are only used early in boot - * when the kernel is running single-threaded. - */ - static unsigned long __meminitdata last_start_pfn, last_end_pfn; - static int __meminitdata last_nid; - if (last_start_pfn <= pfn && pfn < last_end_pfn) - return last_nid; + if (state->last_start <= pfn && pfn < state->last_end) + return state->last_nid; nid = memblock_search_pfn_nid(pfn, &start_pfn, &end_pfn); if (nid != -1) { - last_start_pfn = start_pfn; - last_end_pfn = end_pfn; - last_nid = nid; + state->last_start = start_pfn; + state->last_end = end_pfn; + state->last_nid = nid; } return nid; } #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ +struct __meminitdata mminit_pfnnid_cache global_init_state; + +/* Only safe to use early in boot when initialisation is single-threaded */ int __meminit early_pfn_to_nid(unsigned long pfn) { int nid; - nid = __early_pfn_to_nid(pfn); + /* The system will behave unpredictably otherwise */ + BUG_ON(system_state != SYSTEM_BOOTING); + + nid = __early_pfn_to_nid(pfn, &global_init_state); if (nid >= 0) return nid; /* just returns 0 */ @@ -4497,15 +4499,23 @@ int __meminit early_pfn_to_nid(unsigned long pfn) } #ifdef CONFIG_NODES_SPAN_OTHER_NODES -bool __meminit early_pfn_in_nid(unsigned long pfn, int node) +bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node, + struct mminit_pfnnid_cache *state) { int nid; - nid = __early_pfn_to_nid(pfn); + nid = __early_pfn_to_nid(pfn, state); if (nid >= 0 && nid != node) return false; return true; } + +/* Only safe to use early in boot when initialisation is single-threaded */ +bool __meminit early_pfn_in_nid(unsigned long pfn, int node) +{ + return meminit_pfn_in_nid(pfn, node, &global_init_state); +} + #endif /** -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965169AbbDWKfu (ORCPT ); Thu, 23 Apr 2015 06:35:50 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49934 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933511AbbDWKdY (ORCPT ); Thu, 23 Apr 2015 06:33:24 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 06/13] mm: meminit: Inline some helper functions Date: Thu, 23 Apr 2015 11:33:09 +0100 Message-Id: <1429785196-7668-7-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org early_pfn_in_nid() and meminit_pfn_in_nid() are small functions that are unnecessarily visible outside memory initialisation. As well as unnecessary visibility, it's unnecessary function call overhead when initialising pages. This patch moves the helpers inline. Signed-off-by: Mel Gorman --- include/linux/mmzone.h | 9 ------ mm/page_alloc.c | 75 +++++++++++++++++++++++++------------------------- 2 files changed, 38 insertions(+), 46 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a67b33e52dfe..e3d8a2bd8d78 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1227,15 +1227,6 @@ struct mminit_pfnnid_cache { int last_nid; }; -#ifdef CONFIG_NODES_SPAN_OTHER_NODES -bool early_pfn_in_nid(unsigned long pfn, int nid); -bool meminit_pfn_in_nid(unsigned long pfn, int node, - struct mminit_pfnnid_cache *state); -#else -#define early_pfn_in_nid(pfn, nid) (1) -#define meminit_pfn_in_nid(pfn, nid, state) (1) -#endif - #ifndef early_pfn_valid #define early_pfn_valid(pfn) (1) #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f556ed63b964..8b4659aa0bc2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -907,6 +907,44 @@ void __init __free_pages_bootmem(struct page *page, unsigned long pfn, __free_pages(page, order); } +#if defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) || \ + defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) +/* Only safe to use early in boot when initialisation is single-threaded */ +struct __meminitdata mminit_pfnnid_cache global_init_state; +int __meminit early_pfn_to_nid(unsigned long pfn) +{ + int nid; + + /* The system will behave unpredictably otherwise */ + BUG_ON(system_state != SYSTEM_BOOTING); + + nid = __early_pfn_to_nid(pfn, &global_init_state); + if (nid >= 0) + return nid; + /* just returns 0 */ + return 0; +} +#endif + +#ifdef CONFIG_NODES_SPAN_OTHER_NODES +static inline bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node, + struct mminit_pfnnid_cache *state) +{ + int nid; + + nid = __early_pfn_to_nid(pfn, state); + if (nid >= 0 && nid != node) + return false; + return true; +} + +/* Only safe to use early in boot when initialisation is single-threaded */ +static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node) +{ + return meminit_pfn_in_nid(pfn, node, &global_init_state); +} +#endif + #ifdef CONFIG_CMA /* Free whole pageblock and set its migration type to MIGRATE_CMA. */ void __init init_cma_reserved_pageblock(struct page *page) @@ -4481,43 +4519,6 @@ int __meminit __early_pfn_to_nid(unsigned long pfn, } #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ -struct __meminitdata mminit_pfnnid_cache global_init_state; - -/* Only safe to use early in boot when initialisation is single-threaded */ -int __meminit early_pfn_to_nid(unsigned long pfn) -{ - int nid; - - /* The system will behave unpredictably otherwise */ - BUG_ON(system_state != SYSTEM_BOOTING); - - nid = __early_pfn_to_nid(pfn, &global_init_state); - if (nid >= 0) - return nid; - /* just returns 0 */ - return 0; -} - -#ifdef CONFIG_NODES_SPAN_OTHER_NODES -bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node, - struct mminit_pfnnid_cache *state) -{ - int nid; - - nid = __early_pfn_to_nid(pfn, state); - if (nid >= 0 && nid != node) - return false; - return true; -} - -/* Only safe to use early in boot when initialisation is single-threaded */ -bool __meminit early_pfn_in_nid(unsigned long pfn, int node) -{ - return meminit_pfn_in_nid(pfn, node, &global_init_state); -} - -#endif - /** * free_bootmem_with_active_regions - Call memblock_free_early_nid for each active range * @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed. -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965105AbbDWKfW (ORCPT ); Thu, 23 Apr 2015 06:35:22 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49921 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933559AbbDWKdZ (ORCPT ); Thu, 23 Apr 2015 06:33:25 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Date: Thu, 23 Apr 2015 11:33:10 +0100 Message-Id: <1429785196-7668-8-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch initalises all low memory struct pages and 2G of the highest zone on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. That config option cannot be set but will be available in a later patch. Parallel initialisation of struct page depends on some features from memory hotplug and it is necessary to alter alter section annotations. Signed-off-by: Mel Gorman --- drivers/base/node.c | 11 +++++-- include/linux/mmzone.h | 8 ++++++ mm/Kconfig | 18 ++++++++++++ mm/internal.h | 8 ++++++ mm/page_alloc.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++-- 5 files changed, 117 insertions(+), 6 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 36fabe43cd44..d03e976b4431 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -361,12 +361,16 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid) #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE #define page_initialized(page) (page->lru.next) -static int get_nid_for_pfn(unsigned long pfn) +static int get_nid_for_pfn(struct pglist_data *pgdat, unsigned long pfn) { struct page *page; if (!pfn_valid_within(pfn)) return -1; +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT + if (pgdat && pfn >= pgdat->first_deferred_pfn) + return early_pfn_to_nid(pfn); +#endif page = pfn_to_page(pfn); if (!page_initialized(page)) return -1; @@ -378,6 +382,7 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) { int ret; unsigned long pfn, sect_start_pfn, sect_end_pfn; + struct pglist_data *pgdat = NODE_DATA(nid); if (!mem_blk) return -EFAULT; @@ -390,7 +395,7 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { int page_nid; - page_nid = get_nid_for_pfn(pfn); + page_nid = get_nid_for_pfn(pgdat, pfn); if (page_nid < 0) continue; if (page_nid != nid) @@ -429,7 +434,7 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { int nid; - nid = get_nid_for_pfn(pfn); + nid = get_nid_for_pfn(NULL, pfn); if (nid < 0) continue; if (!node_online(nid)) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e3d8a2bd8d78..4882c53b70b5 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -762,6 +762,14 @@ typedef struct pglist_data { /* Number of pages migrated during the rate limiting time interval */ unsigned long numabalancing_migrate_nr_pages; #endif + +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT + /* + * If memory initialisation on large machines is deferred then this + * is the first PFN that needs to be initialised. + */ + unsigned long first_deferred_pfn; +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/mm/Kconfig b/mm/Kconfig index a03131b6ba8e..3e40cb64e226 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -629,3 +629,21 @@ config MAX_STACK_SIZE_MB changed to a smaller value in which case that is used. A sane initial value is 80 MB. + +# For architectures that support deferred memory initialisation +config ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT + bool + +config DEFERRED_STRUCT_PAGE_INIT + bool "Defer initialisation of struct pages to kswapd" + default n + depends on ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT + depends on MEMORY_HOTPLUG + help + Ordinarily all struct pages are initialised during early boot in a + single thread. On very large machines this can take a considerable + amount of time. If this option is set, large machines will bring up + a subset of memmap at boot and then initialise the rest in parallel + when kswapd starts. This has a potential performance impact on + processes running early in the lifetime of the systemm until kswapd + finishes the initialisation. diff --git a/mm/internal.h b/mm/internal.h index 76b605139c7a..4a73f74846bd 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -385,6 +385,14 @@ static inline void mminit_verify_zonelist(void) } #endif /* CONFIG_DEBUG_MEMORY_INIT */ +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +#define __defermem_init __meminit +#define __defer_init __meminit +#else +#define __defermem_init +#define __defer_init __init +#endif + /* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */ #if defined(CONFIG_SPARSEMEM) extern void mminit_validate_memmodel_limits(unsigned long *start_pfn, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8b4659aa0bc2..c7c2d20c8bb5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -235,6 +235,64 @@ EXPORT_SYMBOL(nr_online_nodes); int page_group_by_mobility_disabled __read_mostly; +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +static inline void reset_deferred_meminit(pg_data_t *pgdat) +{ + pgdat->first_deferred_pfn = ULONG_MAX; +} + +/* Returns true if the struct page for the pfn is uninitialised */ +static inline bool __defermem_init early_page_uninitialised(unsigned long pfn) +{ + int nid = early_pfn_to_nid(pfn); + + if (pfn >= NODE_DATA(nid)->first_deferred_pfn) + return true; + + return false; +} + +/* + * Returns false when the remaining initialisation should be deferred until + * later in the boot cycle when it can be parallelised. + */ +static inline bool update_defer_init(pg_data_t *pgdat, + unsigned long pfn, unsigned long zone_end, + unsigned long *nr_initialised) +{ + /* Always populate low zones for address-contrained allocations */ + if (zone_end < pgdat_end_pfn(pgdat)) + return true; + + /* Initialise at least 2G of the highest zone */ + (*nr_initialised)++; + if (*nr_initialised > (2UL << (30 - PAGE_SHIFT)) && + (pfn & (PAGES_PER_SECTION - 1)) == 0) { + pgdat->first_deferred_pfn = pfn; + return false; + } + + return true; +} +#else +static inline void reset_deferred_meminit(pg_data_t *pgdat) +{ +} + +static inline bool early_page_uninitialised(unsigned long pfn) +{ + return false; +} + +static inline bool update_defer_init(pg_data_t *pgdat, + unsigned long pfn, unsigned long zone_end, + unsigned long *nr_initialised) +{ + return true; +} +#endif + + void set_pageblock_migratetype(struct page *page, int migratetype) { if (unlikely(page_group_by_mobility_disabled && @@ -886,8 +944,8 @@ static void __free_pages_ok(struct page *page, unsigned int order) local_irq_restore(flags); } -void __init __free_pages_bootmem(struct page *page, unsigned long pfn, - unsigned int order) +static void __defer_init __free_pages_boot_core(struct page *page, + unsigned long pfn, unsigned int order) { unsigned int nr_pages = 1 << order; struct page *p = page; @@ -945,6 +1003,14 @@ static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node) } #endif +void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn, + unsigned int order) +{ + if (early_page_uninitialised(pfn)) + return; + return __free_pages_boot_core(page, pfn, order); +} + #ifdef CONFIG_CMA /* Free whole pageblock and set its migration type to MIGRATE_CMA. */ void __init init_cma_reserved_pageblock(struct page *page) @@ -4217,14 +4283,16 @@ static void setup_zone_migrate_reserve(struct zone *zone) void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long start_pfn, enum memmap_context context) { + pg_data_t *pgdat = NODE_DATA(nid); unsigned long end_pfn = start_pfn + size; unsigned long pfn; struct zone *z; + unsigned long nr_initialised = 0; if (highest_memmap_pfn < end_pfn - 1) highest_memmap_pfn = end_pfn - 1; - z = &NODE_DATA(nid)->node_zones[zone]; + z = &pgdat->node_zones[zone]; for (pfn = start_pfn; pfn < end_pfn; pfn++) { /* * There can be holes in boot-time mem_map[]s @@ -4236,6 +4304,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, continue; if (!early_pfn_in_nid(pfn, nid)) continue; + if (!update_defer_init(pgdat, pfn, end_pfn, + &nr_initialised)) + break; } __init_single_pfn(pfn, zone, nid); } @@ -5037,6 +5108,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, /* pg_data_t should be reset to zero when it's allocated */ WARN_ON(pgdat->nr_zones || pgdat->classzone_idx); + reset_deferred_meminit(pgdat); pgdat->node_id = nid; pgdat->node_start_pfn = node_start_pfn; #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965024AbbDWKfF (ORCPT ); Thu, 23 Apr 2015 06:35:05 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49925 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932910AbbDWKdZ (ORCPT ); Thu, 23 Apr 2015 06:33:25 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 08/13] mm: meminit: Initialise remaining struct pages in parallel with kswapd Date: Thu, 23 Apr 2015 11:33:11 +0100 Message-Id: <1429785196-7668-9-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Only a subset of struct pages are initialised at the moment. When this patch is applied kswapd initialise the remaining struct pages in parallel. This should boot faster by spreading the work to multiple CPUs and initialising data that is local to the CPU. The user-visible effect on large machines is that free memory will appear to rapidly increase early in the lifetime of the system until kswapd reports that all memory is initialised in the kernel log. Once initialised there should be no other user-visibile effects. Signed-off-by: Mel Gorman --- mm/internal.h | 6 +++ mm/mm_init.c | 1 + mm/page_alloc.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-- mm/vmscan.c | 6 ++- 4 files changed, 123 insertions(+), 6 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 4a73f74846bd..2c4057140bec 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -388,9 +388,15 @@ static inline void mminit_verify_zonelist(void) #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT #define __defermem_init __meminit #define __defer_init __meminit + +void deferred_init_memmap(int nid); #else #define __defermem_init #define __defer_init __init + +static inline void deferred_init_memmap(int nid) +{ +} #endif /* mminit_validate_memmodel_limits is independent of CONFIG_DEBUG_MEMORY_INIT */ diff --git a/mm/mm_init.c b/mm/mm_init.c index 5f420f7fafa1..28fbf87b20aa 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "internal.h" #ifdef CONFIG_DEBUG_MEMORY_INIT diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c7c2d20c8bb5..f2db3d7aa6cb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -252,6 +252,14 @@ static inline bool __defermem_init early_page_uninitialised(unsigned long pfn) return false; } +static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid) +{ + if (pfn >= NODE_DATA(nid)->first_deferred_pfn) + return true; + + return false; +} + /* * Returns false when the remaining initialisation should be deferred until * later in the boot cycle when it can be parallelised. @@ -284,6 +292,11 @@ static inline bool early_page_uninitialised(unsigned long pfn) return false; } +static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid) +{ + return false; +} + static inline bool update_defer_init(pg_data_t *pgdat, unsigned long pfn, unsigned long zone_end, unsigned long *nr_initialised) @@ -880,14 +893,45 @@ static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone, return __init_single_page(pfn_to_page(pfn), pfn, zone, nid); } -void reserve_bootmem_region(unsigned long start, unsigned long end) +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +static void init_reserved_page(unsigned long pfn) +{ + pg_data_t *pgdat; + int nid, zid; + + if (!early_page_uninitialised(pfn)) + return; + + nid = early_pfn_to_nid(pfn); + pgdat = NODE_DATA(nid); + + for (zid = 0; zid < MAX_NR_ZONES; zid++) { + struct zone *zone = &pgdat->node_zones[zid]; + + if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) + break; + } + __init_single_pfn(pfn, zid, nid); +} +#else +static inline void init_reserved_page(unsigned long pfn) +{ +} +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ + +void __meminit reserve_bootmem_region(unsigned long start, unsigned long end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); - for (; start_pfn < end_pfn; start_pfn++) - if (pfn_valid(start_pfn)) - SetPageReserved(pfn_to_page(start_pfn)); + for (; start_pfn < end_pfn; start_pfn++) { + if (pfn_valid(start_pfn)) { + struct page *page = pfn_to_page(start_pfn); + + init_reserved_page(start_pfn); + SetPageReserved(page); + } + } } static bool free_pages_prepare(struct page *page, unsigned int order) @@ -1011,6 +1055,67 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn, return __free_pages_boot_core(page, pfn, order); } +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +/* Initialise remaining memory on a node */ +void __defermem_init deferred_init_memmap(int nid) +{ + unsigned long start = jiffies; + struct mminit_pfnnid_cache nid_init_state = { }; + + pg_data_t *pgdat = NODE_DATA(nid); + int zid; + unsigned long first_init_pfn = pgdat->first_deferred_pfn; + + if (first_init_pfn == ULONG_MAX) + return; + + /* Sanity check boundaries */ + BUG_ON(pgdat->first_deferred_pfn < pgdat->node_start_pfn); + BUG_ON(pgdat->first_deferred_pfn > pgdat_end_pfn(pgdat)); + pgdat->first_deferred_pfn = ULONG_MAX; + + for (zid = 0; zid < MAX_NR_ZONES; zid++) { + struct zone *zone = pgdat->node_zones + zid; + unsigned long walk_start, walk_end; + int i; + + for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) { + unsigned long pfn, end_pfn; + + end_pfn = min(walk_end, zone_end_pfn(zone)); + pfn = first_init_pfn; + if (pfn < walk_start) + pfn = walk_start; + if (pfn < zone->zone_start_pfn) + pfn = zone->zone_start_pfn; + + for (; pfn < end_pfn; pfn++) { + struct page *page; + + if (!pfn_valid(pfn)) + continue; + + if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) + continue; + + if (page->flags) { + VM_BUG_ON(page_zone(page) != zone); + continue; + } + + __init_single_page(page, pfn, zid, nid); + __free_pages_boot_core(page, pfn, 0); + cond_resched(); + } + first_init_pfn = max(end_pfn, first_init_pfn); + } + } + + pr_info("kswapd %d initialised deferred memory in %ums\n", nid, + jiffies_to_msecs(jiffies - start)); +} +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ + #ifdef CONFIG_CMA /* Free whole pageblock and set its migration type to MIGRATE_CMA. */ void __init init_cma_reserved_pageblock(struct page *page) @@ -4221,6 +4326,9 @@ static void setup_zone_migrate_reserve(struct zone *zone) zone->nr_migrate_reserve_block = reserve; for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { + if (!early_page_nid_uninitialised(pfn, zone_to_nid(zone))) + return; + if (!pfn_valid(pfn)) continue; page = pfn_to_page(pfn); diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8eadd71bac..c4895d26d036 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3348,7 +3348,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) * If there are applications that are active memory-allocators * (most normal use), this basically shouldn't matter. */ -static int kswapd(void *p) +static int __defermem_init kswapd(void *p) { unsigned long order, new_order; unsigned balanced_order; @@ -3383,6 +3383,8 @@ static int kswapd(void *p) tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; set_freezable(); + deferred_init_memmap(pgdat->node_id); + order = new_order = 0; balanced_order = 0; classzone_idx = new_classzone_idx = pgdat->nr_zones - 1; @@ -3538,7 +3540,7 @@ static int cpu_callback(struct notifier_block *nfb, unsigned long action, * This kswapd start function will be called by init and node-hot-add. * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added. */ -int kswapd_run(int nid) +int __defermem_init kswapd_run(int nid) { pg_data_t *pgdat = NODE_DATA(nid); int ret = 0; -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933772AbbDWKdd (ORCPT ); Thu, 23 Apr 2015 06:33:33 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49934 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933657AbbDWKd0 (ORCPT ); Thu, 23 Apr 2015 06:33:26 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 09/13] mm: meminit: Minimise number of pfn->page lookups during initialisation Date: Thu, 23 Apr 2015 11:33:12 +0100 Message-Id: <1429785196-7668-10-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Deferred struct page initialisation is using pfn_to_page() on every PFN unnecessarily. This patch minimises the number of lookups and scheduler checks. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f2db3d7aa6cb..11125634e375 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1081,6 +1081,7 @@ void __defermem_init deferred_init_memmap(int nid) for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) { unsigned long pfn, end_pfn; + struct page *page = NULL; end_pfn = min(walk_end, zone_end_pfn(zone)); pfn = first_init_pfn; @@ -1090,13 +1091,32 @@ void __defermem_init deferred_init_memmap(int nid) pfn = zone->zone_start_pfn; for (; pfn < end_pfn; pfn++) { - struct page *page; - - if (!pfn_valid(pfn)) + if (!pfn_valid_within(pfn)) continue; - if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) + /* + * Ensure pfn_valid is checked every + * MAX_ORDER_NR_PAGES for memory holes + */ + if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) { + if (!pfn_valid(pfn)) { + page = NULL; + continue; + } + } + + if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) { + page = NULL; continue; + } + + /* Minimise pfn page lookups and scheduler checks */ + if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) { + page++; + } else { + page = pfn_to_page(pfn); + cond_resched(); + } if (page->flags) { VM_BUG_ON(page_zone(page) != zone); @@ -1105,7 +1125,6 @@ void __defermem_init deferred_init_memmap(int nid) __init_single_page(page, pfn, zid, nid); __free_pages_boot_core(page, pfn, 0); - cond_resched(); } first_init_pfn = max(end_pfn, first_init_pfn); } -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964907AbbDWKeE (ORCPT ); Thu, 23 Apr 2015 06:34:04 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49921 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933703AbbDWKd1 (ORCPT ); Thu, 23 Apr 2015 06:33:27 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 10/13] x86: mm: Enable deferred struct page initialisation on x86-64 Date: Thu, 23 Apr 2015 11:33:13 +0100 Message-Id: <1429785196-7668-11-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Subject says it all. Other architectures may enable on a case-by-case basis after auditing early_pfn_to_nid and testing. Signed-off-by: Mel Gorman --- arch/x86/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b7d31ca55187..1beff8a8fbc9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -18,6 +18,7 @@ config X86_64 select X86_DEV_DMA_OPS select ARCH_USE_CMPXCHG_LOCKREF select HAVE_LIVEPATCH + select ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT ### Arch settings config X86 -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964837AbbDWKeB (ORCPT ); Thu, 23 Apr 2015 06:34:01 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49925 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933106AbbDWKd2 (ORCPT ); Thu, 23 Apr 2015 06:33:28 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 11/13] mm: meminit: Free pages in large chunks where possible Date: Thu, 23 Apr 2015 11:33:14 +0100 Message-Id: <1429785196-7668-12-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Parallel struct page frees pages one at a time. Try free pages as single large pages where possible. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 46 +++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 41 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 11125634e375..73077dc63f0c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1056,6 +1056,20 @@ void __defer_init __free_pages_bootmem(struct page *page, unsigned long pfn, } #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +void __defermem_init deferred_free_range(struct page *page, unsigned long pfn, + int nr_pages) +{ + int i; + + if (nr_pages == MAX_ORDER_NR_PAGES && (pfn & (MAX_ORDER_NR_PAGES-1)) == 0) { + __free_pages_boot_core(page, pfn, MAX_ORDER-1); + return; + } + + for (i = 0; i < nr_pages; i++, page++, pfn++) + __free_pages_boot_core(page, pfn, 0); +} + /* Initialise remaining memory on a node */ void __defermem_init deferred_init_memmap(int nid) { @@ -1082,6 +1096,9 @@ void __defermem_init deferred_init_memmap(int nid) for_each_mem_pfn_range(i, nid, &walk_start, &walk_end, NULL) { unsigned long pfn, end_pfn; struct page *page = NULL; + struct page *free_base_page = NULL; + unsigned long free_base_pfn = 0; + int nr_to_free = 0; end_pfn = min(walk_end, zone_end_pfn(zone)); pfn = first_init_pfn; @@ -1092,7 +1109,7 @@ void __defermem_init deferred_init_memmap(int nid) for (; pfn < end_pfn; pfn++) { if (!pfn_valid_within(pfn)) - continue; + goto free_range; /* * Ensure pfn_valid is checked every @@ -1101,30 +1118,49 @@ void __defermem_init deferred_init_memmap(int nid) if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) { if (!pfn_valid(pfn)) { page = NULL; - continue; + goto free_range; } } if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) { page = NULL; - continue; + goto free_range; } /* Minimise pfn page lookups and scheduler checks */ if (page && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) { page++; } else { + deferred_free_range(free_base_page, + free_base_pfn, nr_to_free); + free_base_page = NULL; + free_base_pfn = nr_to_free = 0; + page = pfn_to_page(pfn); cond_resched(); } if (page->flags) { VM_BUG_ON(page_zone(page) != zone); - continue; + goto free_range; } __init_single_page(page, pfn, zid, nid); - __free_pages_boot_core(page, pfn, 0); + if (!free_base_page) { + free_base_page = page; + free_base_pfn = pfn; + nr_to_free = 0; + } + nr_to_free++; + + /* Where possible, batch up pages for a single free */ + continue; +free_range: + /* Free the current block of pages to allocator */ + if (free_base_page) + deferred_free_range(free_base_page, free_base_pfn, nr_to_free); + free_base_page = NULL; + free_base_pfn = nr_to_free = 0; } first_init_pfn = max(end_pfn, first_init_pfn); } -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933875AbbDWKeA (ORCPT ); Thu, 23 Apr 2015 06:34:00 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49984 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933705AbbDWKd3 (ORCPT ); Thu, 23 Apr 2015 06:33:29 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 12/13] mm: meminit: Reduce number of times pageblocks are set during struct page init Date: Thu, 23 Apr 2015 11:33:15 +0100 Message-Id: <1429785196-7668-13-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org During parallel sturct page initialisation, ranges are checked for every PFN unnecessarily which increases boot times. This patch alters when the ranges are checked. Signed-off-by: Mel Gorman --- mm/page_alloc.c | 45 +++++++++++++++++++++++---------------------- 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 73077dc63f0c..576b03bc9057 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -852,33 +852,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page) static void __meminit __init_single_page(struct page *page, unsigned long pfn, unsigned long zone, int nid) { - struct zone *z = &NODE_DATA(nid)->node_zones[zone]; - set_page_links(page, zone, nid, pfn); mminit_verify_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); page_cpupid_reset_last(page); - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. Later some blocks near - * the start are marked MIGRATE_RESERVE by - * setup_zone_migrate_reserve() - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - */ - if ((z->zone_start_pfn <= pfn) - && (pfn < zone_end_pfn(z)) - && !(pfn & (pageblock_nr_pages - 1))) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - INIT_LIST_HEAD(&page->lru); #ifdef WANT_PAGE_VIRTUAL /* The shift won't overflow because ZONE_NORMAL is below 4G. */ @@ -1062,6 +1041,7 @@ void __defermem_init deferred_free_range(struct page *page, unsigned long pfn, int i; if (nr_pages == MAX_ORDER_NR_PAGES && (pfn & (MAX_ORDER_NR_PAGES-1)) == 0) { + set_pageblock_migratetype(page, MIGRATE_MOVABLE); __free_pages_boot_core(page, pfn, MAX_ORDER-1); return; } @@ -4471,7 +4451,28 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, &nr_initialised)) break; } - __init_single_pfn(pfn, zone, nid); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations + * to reserve their blocks rather than leaking throughout + * the address space during boot when many long-lived + * kernel allocations are made. Later some blocks near + * the start are marked MIGRATE_RESERVE by + * setup_zone_migrate_reserve() + * + * bitmap is created for zone's valid pfn range. but memmap + * can be created for invalid pages (for alignment) + * check here not to call set_pageblock_migratetype() against + * pfn out of zone. + */ + if (!(pfn & (pageblock_nr_pages - 1))) { + struct page *page = pfn_to_page(pfn); + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + __init_single_page(page, pfn, zone, nid); + } else { + __init_single_pfn(pfn, zone, nid); + } } } -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933834AbbDWKde (ORCPT ); Thu, 23 Apr 2015 06:33:34 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49997 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933709AbbDWKda (ORCPT ); Thu, 23 Apr 2015 06:33:30 -0400 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML , Mel Gorman Subject: [PATCH 13/13] mm: meminit: Remove mminit_verify_page_links Date: Thu, 23 Apr 2015 11:33:16 +0100 Message-Id: <1429785196-7668-14-git-send-email-mgorman@suse.de> X-Mailer: git-send-email 2.3.5 In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org mminit_verify_page_links() is an extremely paranoid check that was introduced when memory initialisation was being heavily reworked. Profiles indicated that up to 10% of parallel memory initialisation was spent on checking this for every page. The cost could be reduced but in practice this check only found problems very early during the initialisation rewrite and has found nothing since. This patch removes an expensive unnecessary check. Signed-off-by: Mel Gorman --- mm/internal.h | 8 -------- mm/mm_init.c | 8 -------- mm/page_alloc.c | 1 - 3 files changed, 17 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 2c4057140bec..c73ad248f8f4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -360,10 +360,7 @@ do { \ } while (0) extern void mminit_verify_pageflags_layout(void); -extern void mminit_verify_page_links(struct page *page, - enum zone_type zone, unsigned long nid, unsigned long pfn); extern void mminit_verify_zonelist(void); - #else static inline void mminit_dprintk(enum mminit_level level, @@ -375,11 +372,6 @@ static inline void mminit_verify_pageflags_layout(void) { } -static inline void mminit_verify_page_links(struct page *page, - enum zone_type zone, unsigned long nid, unsigned long pfn) -{ -} - static inline void mminit_verify_zonelist(void) { } diff --git a/mm/mm_init.c b/mm/mm_init.c index 28fbf87b20aa..fdadf918de76 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -131,14 +131,6 @@ void __init mminit_verify_pageflags_layout(void) BUG_ON(or_mask != add_mask); } -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone, - unsigned long nid, unsigned long pfn) -{ - BUG_ON(page_to_nid(page) != nid); - BUG_ON(page_zonenum(page) != zone); - BUG_ON(page_to_pfn(page) != pfn); -} - static __init int set_mminit_loglevel(char *str) { get_option(&str, &mminit_loglevel); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 576b03bc9057..739b1840de2c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -853,7 +853,6 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn, unsigned long zone, int nid) { set_page_links(page, zone, nid, pfn); - mminit_verify_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); page_cpupid_reset_last(page); -- 2.3.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030238AbbDWPyS (ORCPT ); Thu, 23 Apr 2015 11:54:18 -0400 Received: from numascale.com ([213.162.240.84]:59813 "EHLO numascale.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966258AbbDWPyQ (ORCPT ); Thu, 23 Apr 2015 11:54:16 -0400 Date: Thu, 23 Apr 2015 23:53:57 +0800 From: Daniel J Blueman Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3 To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Andrew Morton , LKML , Mel Gorman , "'Steffen Persvold'" Message-Id: <1429804437.24139.3@cpanel21.proisp.no> In-Reply-To: <1429785196-7668-1-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> X-Mailer: geary/0.8.3 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - cpanel21.proisp.no X-AntiAbuse: Original Domain - vger.kernel.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - numascale.com X-Get-Message-Sender-Via: cpanel21.proisp.no: authenticated_id: daniel@numascale.com X-Source: X-Source-Args: X-Source-Dir: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman wrote: > The big change here is an adjustment to the topology_init path that > caused > soft lockups on Waiman and Daniel Blue had reported it was an > expensive > function. > > Changelog since v2 > o Reduce overhead of topology_init > o Remove boot-time kernel parameter to enable/disable > o Enable on UMA > > Changelog since v1 > o Always initialise low zones > o Typo corrections > o Rename parallel mem init to parallel struct page init > o Rebase to 4.0 [] Splendid work! On this 256c setup, topology_init now takes 185ms. This brings the kernel boot time down to 324s [1]. It turns out that one memset is responsible for most of the time setting up the the PUDs and PMDs; adapting memset to using non-temporal writes [3] avoids generating RMW cycles, bringing boot time down to 186s [2]. If this is a possibility, I can split this patch and map other arch's memset_nocache to memset, or change the callsite as preferred; comments welcome. Thanks, Daniel [1] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2.txt [2] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2-nontemporal.txt -- [3] From f822139736cab8434302693c635fa146b465273c Mon Sep 17 00:00:00 2001 From: Daniel J Blueman Date: Thu, 23 Apr 2015 23:26:27 +0800 Subject: [RFC] Speedup PMD setup Using non-temporal writes prevents read-modify-write cycles, which are much slower over large topologies. Adapt the existing memset() function into a _nocache variant and use when setting up PMDs during early boot to reduce boot time. Signed-off-by: Daniel J Blueman --- arch/x86/include/asm/string_64.h | 3 ++ arch/x86/lib/memset_64.S | 90 ++++++++++++++++++++++++++++++++++++++++ mm/memblock.c | 2 +- 3 files changed, 94 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index e466119..1ef28d0 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -55,6 +55,8 @@ extern void *memcpy(void *to, const void *from, size_t len); #define __HAVE_ARCH_MEMSET void *memset(void *s, int c, size_t n); void *__memset(void *s, int c, size_t n); +void *memset_nocache(void *s, int c, size_t n); +void *__memset_nocache(void *s, int c, size_t n); #define __HAVE_ARCH_MEMMOVE void *memmove(void *dest, const void *src, size_t count); @@ -77,6 +79,7 @@ int strcmp(const char *cs, const char *ct); #define memcpy(dst, src, len) __memcpy(dst, src, len) #define memmove(dst, src, len) __memmove(dst, src, len) #define memset(s, c, n) __memset(s, c, n) +#define memset_nocache(s, c, n) __memset_nocache(s, c, n) #endif #endif /* __KERNEL__ */ diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S index 6f44935..fb46f78 100644 --- a/arch/x86/lib/memset_64.S +++ b/arch/x86/lib/memset_64.S @@ -137,6 +137,96 @@ ENTRY(__memset) ENDPROC(memset) ENDPROC(__memset) +/* + * bzero_nocache - set a memory block to zero. This function uses + * non-temporal writes in the fastpath + * + * rdi destination + * rsi value (char) + * rdx count (bytes) + * + * rax original destination + */ + +ENTRY(memset_nocache) +ENTRY(__memset_nocache) + CFI_STARTPROC + movq %rdi,%r10 + + /* expand byte value */ + movzbl %sil,%ecx + movabs $0x0101010101010101,%rax + imulq %rcx,%rax + + /* align dst */ + movl %edi,%r9d + andl $7,%r9d + jnz bad_alignment + CFI_REMEMBER_STATE +after_bad_alignment: + + movq %rdx,%rcx + shrq $6,%rcx + jz handle_tail + + .p2align 4 +loop_64: + decq %rcx + movnti %rax,(%rdi) + movnti %rax,8(%rdi) + movnti %rax,16(%rdi) + movnti %rax,24(%rdi) + movnti %rax,32(%rdi) + movnti %rax,40(%rdi) + movnti %rax,48(%rdi) + movnti %rax,56(%rdi) + leaq 64(%rdi),%rdi + jnz loop_64 + + /* Handle tail in loops; the loops should be faster than hard + to predict jump tables */ + .p2align 4 +handle_tail: + movl %edx,%ecx + andl $63&(~7),%ecx + jz handle_7 + shrl $3,%ecx + .p2align 4 +loop_8: + decl %ecx + movnti %rax,(%rdi) + leaq 8(%rdi),%rdi + jnz loop_8 + +handle_7: + andl $7,%edx + jz ende + .p2align 4 +loop_1: + decl %edx + movb %al,(%rdi) + leaq 1(%rdi),%rdi + jnz loop_1 + +ende: + movq %r10,%rax + ret + + CFI_RESTORE_STATE +bad_alignment: + cmpq $7,%rdx + jbe handle_7 + movnti %rax,(%rdi) /* unaligned store */ + movq $8,%r8 + subq %r9,%r8 + addq %r8,%rdi + subq %r8,%rdx + jmp after_bad_alignment +final: + CFI_ENDPROC +ENDPROC(memset_nocache) +ENDPROC(__memset_nocache) + /* Some CPUs support enhanced REP MOVSB/STOSB feature. * It is recommended to use this when possible. * diff --git a/mm/memblock.c b/mm/memblock.c index f3e97d8..df434d2 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1212,7 +1212,7 @@ again: done: memblock_reserve(alloc, size); ptr = phys_to_virt(alloc); - memset(ptr, 0, size); + memset_nocache(ptr, 0, size); /* * The min_count is set to 0 so that bootmem allocated blocks From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030240AbbDWP4P (ORCPT ); Thu, 23 Apr 2015 11:56:15 -0400 Received: from cantor2.suse.de ([195.135.220.15]:59103 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757029AbbDWP4M (ORCPT ); Thu, 23 Apr 2015 11:56:12 -0400 Date: Thu, 23 Apr 2015 16:56:07 +0100 From: Mel Gorman To: Linux-MM Cc: Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Andrew Morton , LKML Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Message-ID: <20150423155607.GA2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-8-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1429785196-7668-8-git-send-email-mgorman@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 23, 2015 at 11:33:10AM +0100, Mel Gorman wrote: > This patch initalises all low memory struct pages and 2G of the highest zone > on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT > is set. That config option cannot be set but will be available in a later > patch. Parallel initialisation of struct page depends on some features > from memory hotplug and it is necessary to alter alter section annotations. > > Signed-off-by: Mel Gorman I belatedly noticed that this causes section warnings. It'll be harmless for testing but the next (hopefully last) version will have this on top diff --git a/drivers/base/node.c b/drivers/base/node.c index d03e976b4431..97ab2c4dd39e 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -361,14 +361,14 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid) #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE #define page_initialized(page) (page->lru.next) -static int get_nid_for_pfn(struct pglist_data *pgdat, unsigned long pfn) +static int __init_refok get_nid_for_pfn(unsigned long pfn) { struct page *page; if (!pfn_valid_within(pfn)) return -1; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT - if (pgdat && pfn >= pgdat->first_deferred_pfn) + if (system_state == SYSTEM_BOOTING) return early_pfn_to_nid(pfn); #endif page = pfn_to_page(pfn); @@ -382,7 +382,6 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) { int ret; unsigned long pfn, sect_start_pfn, sect_end_pfn; - struct pglist_data *pgdat = NODE_DATA(nid); if (!mem_blk) return -EFAULT; @@ -395,7 +394,7 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { int page_nid; - page_nid = get_nid_for_pfn(pgdat, pfn); + page_nid = get_nid_for_pfn(pfn); if (page_nid < 0) continue; if (page_nid != nid) @@ -434,7 +433,7 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { int nid; - nid = get_nid_for_pfn(NULL, pfn); + nid = get_nid_for_pfn(pfn); if (nid < 0) continue; if (!node_online(nid)) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966553AbbDWQas (ORCPT ); Thu, 23 Apr 2015 12:30:48 -0400 Received: from cantor2.suse.de ([195.135.220.15]:33112 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966314AbbDWQap (ORCPT ); Thu, 23 Apr 2015 12:30:45 -0400 Date: Thu, 23 Apr 2015 17:30:39 +0100 From: Mel Gorman To: Daniel J Blueman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Andrew Morton , LKML , "'Steffen Persvold'" Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3 Message-ID: <20150423163039.GB2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429804437.24139.3@cpanel21.proisp.no> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <1429804437.24139.3@cpanel21.proisp.no> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 23, 2015 at 11:53:57PM +0800, Daniel J Blueman wrote: > On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman wrote: > >The big change here is an adjustment to the topology_init path > >that caused > >soft lockups on Waiman and Daniel Blue had reported it was an > >expensive > >function. > > > >Changelog since v2 > >o Reduce overhead of topology_init > >o Remove boot-time kernel parameter to enable/disable > >o Enable on UMA > > > >Changelog since v1 > >o Always initialise low zones > >o Typo corrections > >o Rename parallel mem init to parallel struct page init > >o Rebase to 4.0 > [] > > Splendid work! On this 256c setup, topology_init now takes 185ms. > > This brings the kernel boot time down to 324s [1]. Good stuff. Am I correct in thinking that the vanilla kernel takes 732s? > It turns out that > one memset is responsible for most of the time setting up the the > PUDs and PMDs; adapting memset to using non-temporal writes [3] > avoids generating RMW cycles, bringing boot time down to 186s [2]. > > If this is a possibility, I can split this patch and map other > arch's memset_nocache to memset, or change the callsite as > preferred; comments welcome. > In general, I see no problem with the patch and that it would be useful going in before or after this series. I would suggest you splt this into three patches. The first that is an asm-generic alias of memset_nocache to memset with documentation saying it's optional for an architecture to implement. The second would be your implementation for x86 that needs to go to the x86 maintainers. The third would then be the memblock.c change. Thanks. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031629AbbDXTsR (ORCPT ); Fri, 24 Apr 2015 15:48:17 -0400 Received: from g4t3426.houston.hp.com ([15.201.208.54]:54124 "EHLO g4t3426.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031121AbbDXTsP (ORCPT ); Fri, 24 Apr 2015 15:48:15 -0400 Message-ID: <553A9DFC.5040803@hp.com> Date: Fri, 24 Apr 2015 15:48:12 -0400 From: Waiman Long User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12 MIME-Version: 1.0 To: Daniel J Blueman CC: Mel Gorman , Linux-MM , Nathan Zimmer , Dave Hansen , Scott Norton , Andrew Morton , LKML , "'Steffen Persvold'" Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3 References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429804437.24139.3@cpanel21.proisp.no> In-Reply-To: <1429804437.24139.3@cpanel21.proisp.no> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/23/2015 11:53 AM, Daniel J Blueman wrote: > On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman wrote: >> The big change here is an adjustment to the topology_init path that >> caused >> soft lockups on Waiman and Daniel Blue had reported it was an expensive >> function. >> >> Changelog since v2 >> o Reduce overhead of topology_init >> o Remove boot-time kernel parameter to enable/disable >> o Enable on UMA >> >> Changelog since v1 >> o Always initialise low zones >> o Typo corrections >> o Rename parallel mem init to parallel struct page init >> o Rebase to 4.0 > [] > > Splendid work! On this 256c setup, topology_init now takes 185ms. > > This brings the kernel boot time down to 324s [1]. It turns out that > one memset is responsible for most of the time setting up the the PUDs > and PMDs; adapting memset to using non-temporal writes [3] avoids > generating RMW cycles, bringing boot time down to 186s [2]. > > If this is a possibility, I can split this patch and map other arch's > memset_nocache to memset, or change the callsite as preferred; > comments welcome. > > Thanks, > Daniel > > [1] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2.txt > [2] > https://resources.numascale.com/telemetry/defermem/h8qgl-defer2-nontemporal.txt > > -- [3] > > From f822139736cab8434302693c635fa146b465273c Mon Sep 17 00:00:00 2001 > From: Daniel J Blueman > Date: Thu, 23 Apr 2015 23:26:27 +0800 > Subject: [RFC] Speedup PMD setup > > Using non-temporal writes prevents read-modify-write cycles, > which are much slower over large topologies. > > Adapt the existing memset() function into a _nocache variant and use > when setting up PMDs during early boot to reduce boot time. > > Signed-off-by: Daniel J Blueman > --- > arch/x86/include/asm/string_64.h | 3 ++ > arch/x86/lib/memset_64.S | 90 > ++++++++++++++++++++++++++++++++++++++++ > mm/memblock.c | 2 +- > 3 files changed, 94 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/string_64.h > b/arch/x86/include/asm/string_64.h > index e466119..1ef28d0 100644 > --- a/arch/x86/include/asm/string_64.h > +++ b/arch/x86/include/asm/string_64.h > @@ -55,6 +55,8 @@ extern void *memcpy(void *to, const void *from, > size_t len); > #define __HAVE_ARCH_MEMSET > void *memset(void *s, int c, size_t n); > void *__memset(void *s, int c, size_t n); > +void *memset_nocache(void *s, int c, size_t n); > +void *__memset_nocache(void *s, int c, size_t n); > > #define __HAVE_ARCH_MEMMOVE > void *memmove(void *dest, const void *src, size_t count); > @@ -77,6 +79,7 @@ int strcmp(const char *cs, const char *ct); > #define memcpy(dst, src, len) __memcpy(dst, src, len) > #define memmove(dst, src, len) __memmove(dst, src, len) > #define memset(s, c, n) __memset(s, c, n) > +#define memset_nocache(s, c, n) __memset_nocache(s, c, n) > #endif > > #endif /* __KERNEL__ */ > diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S > index 6f44935..fb46f78 100644 > --- a/arch/x86/lib/memset_64.S > +++ b/arch/x86/lib/memset_64.S > @@ -137,6 +137,96 @@ ENTRY(__memset) > ENDPROC(memset) > ENDPROC(__memset) > > +/* > + * bzero_nocache - set a memory block to zero. This function uses > + * non-temporal writes in the fastpath > + * > + * rdi destination > + * rsi value (char) > + * rdx count (bytes) > + * > + * rax original destination > + */ > + > +ENTRY(memset_nocache) > +ENTRY(__memset_nocache) > + CFI_STARTPROC > + movq %rdi,%r10 > + > + /* expand byte value */ > + movzbl %sil,%ecx > + movabs $0x0101010101010101,%rax > + imulq %rcx,%rax > + > + /* align dst */ > + movl %edi,%r9d > + andl $7,%r9d > + jnz bad_alignment > + CFI_REMEMBER_STATE > +after_bad_alignment: > + > + movq %rdx,%rcx > + shrq $6,%rcx > + jz handle_tail > + > + .p2align 4 > +loop_64: > + decq %rcx > + movnti %rax,(%rdi) > + movnti %rax,8(%rdi) > + movnti %rax,16(%rdi) > + movnti %rax,24(%rdi) > + movnti %rax,32(%rdi) > + movnti %rax,40(%rdi) > + movnti %rax,48(%rdi) > + movnti %rax,56(%rdi) > + leaq 64(%rdi),%rdi > + jnz loop_64 > + > + Your version of memset_nocache differs from from memset only in the use of movnti instruction. You may consider using compiler macros to make a single copy of source code to generate 2 different versions of executable codes. That will make the new code much easier to maintain. For example, #include ... #define MOVQ movnti #define memset memset_nocache #define __mmset __memset_nocache #include "memset_64.S" Of course, you need to replace the target movq instructions in memset_64.S to MOVQ, define #ifndef MOVQ #define MOVQ movq #endif You also need to use conditional compilation macro to disable the alternate instruction stuff in memset_64.S. Cheers, Longman From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965655AbbD0Wnb (ORCPT ); Mon, 27 Apr 2015 18:43:31 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:40069 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965092AbbD0Wn2 (ORCPT ); Mon, 27 Apr 2015 18:43:28 -0400 Date: Mon, 27 Apr 2015 15:43:27 -0700 From: Andrew Morton To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 03/13] mm: meminit: Only set page reserved in the memblock region Message-Id: <20150427154327.f7326dc16649ae402b5b5dd3@linux-foundation.org> In-Reply-To: <1429785196-7668-4-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-4-git-send-email-mgorman@suse.de> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 23 Apr 2015 11:33:06 +0100 Mel Gorman wrote: > From: Nathan Zimmer > > Currently we when we initialze each page struct is set as reserved upon > initialization. Hard to parse. I changed it to "Currently each page struct is set as reserved upon initialization". > This changes to starting with the reserved bit clear and > then only setting the bit in the reserved region. For what reason? A code comment over reserve_bootmem_region() would be a good way to answer that. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965704AbbD0Wni (ORCPT ); Mon, 27 Apr 2015 18:43:38 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:40078 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965092AbbD0Wne (ORCPT ); Mon, 27 Apr 2015 18:43:34 -0400 Date: Mon, 27 Apr 2015 15:43:33 -0700 From: Andrew Morton To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid Message-Id: <20150427154333.85a1fd2dbc38c7c0888fd4f5@linux-foundation.org> In-Reply-To: <1429785196-7668-6-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-6-git-send-email-mgorman@suse.de> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 23 Apr 2015 11:33:08 +0100 Mel Gorman wrote: > __early_pfn_to_nid() in the generic and arch-specific implementations > use static variables to cache recent lookups. Without the cache > boot times are much higher due to the excessive memblock lookups but > it assumes that memory initialisation is single-threaded. Parallel > initialisation of struct pages will break that assumption so this patch > makes __early_pfn_to_nid() SMP-safe by requiring the caller to cache > recent search information. early_pfn_to_nid() keeps the same interface > but is only safe to use early in boot due to the use of a global static > variable. meminit_pfn_in_nid() is an SMP-safe version that callers must > maintain their own state for. Seems a bit awkward. > +struct __meminitdata mminit_pfnnid_cache global_init_state; > + > +/* Only safe to use early in boot when initialisation is single-threaded */ > int __meminit early_pfn_to_nid(unsigned long pfn) > { > int nid; > > - nid = __early_pfn_to_nid(pfn); > + /* The system will behave unpredictably otherwise */ > + BUG_ON(system_state != SYSTEM_BOOTING); Because of this. Providing a cache per cpu: struct __meminitdata mminit_pfnnid_cache global_init_state[NR_CPUS]; would be simpler? Also, `global_init_state' is a poor name for a kernel-wide symbol. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965717AbbD0Wnu (ORCPT ); Mon, 27 Apr 2015 18:43:50 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:40082 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965487AbbD0Wnp (ORCPT ); Mon, 27 Apr 2015 18:43:45 -0400 Date: Mon, 27 Apr 2015 15:43:44 -0700 From: Andrew Morton To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Message-Id: <20150427154344.421fd9f151bf27d365d02fd2@linux-foundation.org> In-Reply-To: <1429785196-7668-8-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-8-git-send-email-mgorman@suse.de> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 23 Apr 2015 11:33:10 +0100 Mel Gorman wrote: > This patch initalises all low memory struct pages and 2G of the highest zone > on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT > is set. That config option cannot be set but will be available in a later > patch. Parallel initialisation of struct page depends on some features > from memory hotplug and it is necessary to alter alter section annotations. > > ... > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > +#define __defermem_init __meminit > +#define __defer_init __meminit > +#else > +#define __defermem_init > +#define __defer_init __init > +#endif Could we get some comments describing these? What they do, when and where they should be used. I have a suspicion that the naming isn't good, but I didn't spend a lot of time reverse-engineering the intent... From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965749AbbD0Wnz (ORCPT ); Mon, 27 Apr 2015 18:43:55 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:40088 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965722AbbD0Wnw (ORCPT ); Mon, 27 Apr 2015 18:43:52 -0400 Date: Mon, 27 Apr 2015 15:43:50 -0700 From: Andrew Morton To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 08/13] mm: meminit: Initialise remaining struct pages in parallel with kswapd Message-Id: <20150427154350.4d649694a56e5bbc519e1fb4@linux-foundation.org> In-Reply-To: <1429785196-7668-9-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-9-git-send-email-mgorman@suse.de> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 23 Apr 2015 11:33:11 +0100 Mel Gorman wrote: > Only a subset of struct pages are initialised at the moment. When this patch > is applied kswapd initialise the remaining struct pages in parallel. This > should boot faster by spreading the work to multiple CPUs and initialising > data that is local to the CPU. The user-visible effect on large machines > is that free memory will appear to rapidly increase early in the lifetime > of the system until kswapd reports that all memory is initialised in the > kernel log. Once initialised there should be no other user-visibile effects. > > ... > > + pr_info("kswapd %d initialised deferred memory in %ums\n", nid, > + jiffies_to_msecs(jiffies - start)); It might be nice to tell people how much deferred memory kswapd initialised. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965775AbbD0WoA (ORCPT ); Mon, 27 Apr 2015 18:44:00 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:40096 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965722AbbD0Wn5 (ORCPT ); Mon, 27 Apr 2015 18:43:57 -0400 Date: Mon, 27 Apr 2015 15:43:56 -0700 From: Andrew Morton To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 11/13] mm: meminit: Free pages in large chunks where possible Message-Id: <20150427154356.67e3d186b732a2c2b00e49cb@linux-foundation.org> In-Reply-To: <1429785196-7668-12-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-12-git-send-email-mgorman@suse.de> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 23 Apr 2015 11:33:14 +0100 Mel Gorman wrote: > Parallel struct page frees pages one at a time. Try free pages as single > large pages where possible. > > ... > > void __defermem_init deferred_init_memmap(int nid) This function is gruesome in an 80-col display. Even the code comments wrap, which is nuts. Maybe hoist the contents of the outermost loop into a separate function, called for each zone? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965587AbbD0Wqg (ORCPT ); Mon, 27 Apr 2015 18:46:36 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:40147 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965161AbbD0Wqe (ORCPT ); Mon, 27 Apr 2015 18:46:34 -0400 Date: Mon, 27 Apr 2015 15:46:33 -0700 From: Andrew Morton To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. Message-Id: <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> In-Reply-To: <1429785196-7668-3-git-send-email-mgorman@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-3-git-send-email-mgorman@suse.de> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman wrote: > From: Robin Holt : : host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1 : : Recipient address rejected: User unknown in virtual alias : table (in reply to RCPT TO command) Has Robin moved, or is SGI mail busted? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933277AbbD1I2p (ORCPT ); Tue, 28 Apr 2015 04:28:45 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51514 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932581AbbD1I2k (ORCPT ); Tue, 28 Apr 2015 04:28:40 -0400 Date: Tue, 28 Apr 2015 09:28:31 +0100 From: Mel Gorman To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. Message-ID: <20150428082831.GI2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-3-git-send-email-mgorman@suse.de> <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote: > On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman wrote: > > > From: Robin Holt > > : : host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1 > : : Recipient address rejected: User unknown in virtual alias > : table (in reply to RCPT TO command) > > Has Robin moved, or is SGI mail busted? Robin has moved and I do not have an updated address for him. The address used in the patches was the one he posted the patches with. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933191AbbD1Jh6 (ORCPT ); Tue, 28 Apr 2015 05:37:58 -0400 Received: from cantor2.suse.de ([195.135.220.15]:55621 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932510AbbD1Jh4 (ORCPT ); Tue, 28 Apr 2015 05:37:56 -0400 Date: Tue, 28 Apr 2015 10:37:51 +0100 From: Mel Gorman To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 05/13] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid Message-ID: <20150428093751.GJ2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-6-git-send-email-mgorman@suse.de> <20150427154333.85a1fd2dbc38c7c0888fd4f5@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150427154333.85a1fd2dbc38c7c0888fd4f5@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 27, 2015 at 03:43:33PM -0700, Andrew Morton wrote: > On Thu, 23 Apr 2015 11:33:08 +0100 Mel Gorman wrote: > > > __early_pfn_to_nid() in the generic and arch-specific implementations > > use static variables to cache recent lookups. Without the cache > > boot times are much higher due to the excessive memblock lookups but > > it assumes that memory initialisation is single-threaded. Parallel > > initialisation of struct pages will break that assumption so this patch > > makes __early_pfn_to_nid() SMP-safe by requiring the caller to cache > > recent search information. early_pfn_to_nid() keeps the same interface > > but is only safe to use early in boot due to the use of a global static > > variable. meminit_pfn_in_nid() is an SMP-safe version that callers must > > maintain their own state for. > > Seems a bit awkward. > I'm afraid I don't understand which part you mean. > > +struct __meminitdata mminit_pfnnid_cache global_init_state; > > + > > +/* Only safe to use early in boot when initialisation is single-threaded */ > > int __meminit early_pfn_to_nid(unsigned long pfn) > > { > > int nid; > > > > - nid = __early_pfn_to_nid(pfn); > > + /* The system will behave unpredictably otherwise */ > > + BUG_ON(system_state != SYSTEM_BOOTING); > > Because of this. > > Providing a cache per cpu: > > struct __meminitdata mminit_pfnnid_cache global_init_state[NR_CPUS]; > > would be simpler? > It would be simplier in terms of implementation but it's wasteful. We only need a small number of these caches early in boot. NR_CPUS is potentially very large. > > Also, `global_init_state' is a poor name for a kernel-wide symbol. You're right. It's not really global, it's just the one that is used if the caller does not track their own state. It should have been static and I renamed it to early_pfnnid_cache. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933117AbbD1Jxa (ORCPT ); Tue, 28 Apr 2015 05:53:30 -0400 Received: from cantor2.suse.de ([195.135.220.15]:56322 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752866AbbD1Jx2 (ORCPT ); Tue, 28 Apr 2015 05:53:28 -0400 Date: Tue, 28 Apr 2015 10:53:23 +0100 From: Mel Gorman To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Message-ID: <20150428095323.GK2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-8-git-send-email-mgorman@suse.de> <20150427154344.421fd9f151bf27d365d02fd2@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150427154344.421fd9f151bf27d365d02fd2@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 27, 2015 at 03:43:44PM -0700, Andrew Morton wrote: > On Thu, 23 Apr 2015 11:33:10 +0100 Mel Gorman wrote: > > > This patch initalises all low memory struct pages and 2G of the highest zone > > on each node during memory initialisation if CONFIG_DEFERRED_STRUCT_PAGE_INIT > > is set. That config option cannot be set but will be available in a later > > patch. Parallel initialisation of struct page depends on some features > > from memory hotplug and it is necessary to alter alter section annotations. > > > > ... > > > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > > +#define __defermem_init __meminit > > +#define __defer_init __meminit > > +#else > > +#define __defermem_init > > +#define __defer_init __init > > +#endif > > Could we get some comments describing these? What they do, when and > where they should be used. I have a suspicion that the naming isn't > good, but I didn't spend a lot of time reverse-engineering the > intent... > Of course. The next version will have +/* + * Deferred struct page initialisation requires some early init functions that + * are removed before kswapd is up and running. The feature depends on memory + * hotplug so put the data and code required by deferred initialisation into + * the __meminit section where they are preserved. + */ -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965576AbbD1LiZ (ORCPT ); Tue, 28 Apr 2015 07:38:25 -0400 Received: from cantor2.suse.de ([195.135.220.15]:37050 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965298AbbD1LiY (ORCPT ); Tue, 28 Apr 2015 07:38:24 -0400 Date: Tue, 28 Apr 2015 12:38:20 +0100 From: Mel Gorman To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 11/13] mm: meminit: Free pages in large chunks where possible Message-ID: <20150428113819.GL2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-12-git-send-email-mgorman@suse.de> <20150427154356.67e3d186b732a2c2b00e49cb@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150427154356.67e3d186b732a2c2b00e49cb@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 27, 2015 at 03:43:56PM -0700, Andrew Morton wrote: > On Thu, 23 Apr 2015 11:33:14 +0100 Mel Gorman wrote: > > > Parallel struct page frees pages one at a time. Try free pages as single > > large pages where possible. > > > > ... > > > > void __defermem_init deferred_init_memmap(int nid) > > This function is gruesome in an 80-col display. Even the code comments > wrap, which is nuts. Maybe hoist the contents of the outermost loop > into a separate function, called for each zone? I can do better than that because only the highest zone is deferred in this version and the loop is no longer necessary. I should post a V4 before the end of my day that addresses your feedback. It caused a lot of conflicts and it'll be easier to replace the full series than try managing incremental fixes. Thanks Andrew. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031104AbbD1Nll (ORCPT ); Tue, 28 Apr 2015 09:41:41 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:50859 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965640AbbD1Nlg (ORCPT ); Tue, 28 Apr 2015 09:41:36 -0400 Date: Tue, 28 Apr 2015 06:48:10 -0700 From: Andrew Morton To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Message-Id: <20150428064810.0882ad36.akpm@linux-foundation.org> In-Reply-To: <20150428095323.GK2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-8-git-send-email-mgorman@suse.de> <20150427154344.421fd9f151bf27d365d02fd2@linux-foundation.org> <20150428095323.GK2449@suse.de> X-Mailer: Sylpheed 2.7.1 (GTK+ 2.18.9; x86_64-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 28 Apr 2015 10:53:23 +0100 Mel Gorman wrote: > > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > > > +#define __defermem_init __meminit > > > +#define __defer_init __meminit > > > +#else > > > +#define __defermem_init > > > +#define __defer_init __init > > > +#endif > > > > Could we get some comments describing these? What they do, when and > > where they should be used. I have a suspicion that the naming isn't > > good, but I didn't spend a lot of time reverse-engineering the > > intent... > > > > Of course. The next version will have > > +/* > + * Deferred struct page initialisation requires some early init functions that > + * are removed before kswapd is up and running. The feature depends on memory > + * hotplug so put the data and code required by deferred initialisation into > + * the __meminit section where they are preserved. > + */ I'm still not getting it even a little bit :( You say "data and code", so I'd expect to see #define __defer_meminitdata __meminitdata #define __defer_meminit __meminit But the patch doesn't mention the data segment at all. The patch uses both __defermem_init and __defer_init to tag functions (ie: text) and I can't work out why. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966192AbbD1O4i (ORCPT ); Tue, 28 Apr 2015 10:56:38 -0400 Received: from cantor2.suse.de ([195.135.220.15]:52617 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965724AbbD1O4g (ORCPT ); Tue, 28 Apr 2015 10:56:36 -0400 Date: Tue, 28 Apr 2015 15:56:32 +0100 From: Mel Gorman To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 07/13] mm: meminit: Initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set Message-ID: <20150428145632.GN2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-8-git-send-email-mgorman@suse.de> <20150427154344.421fd9f151bf27d365d02fd2@linux-foundation.org> <20150428095323.GK2449@suse.de> <20150428064810.0882ad36.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150428064810.0882ad36.akpm@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 28, 2015 at 06:48:10AM -0700, Andrew Morton wrote: > On Tue, 28 Apr 2015 10:53:23 +0100 Mel Gorman wrote: > > > > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > > > > +#define __defermem_init __meminit > > > > +#define __defer_init __meminit > > > > +#else > > > > +#define __defermem_init > > > > +#define __defer_init __init > > > > +#endif > > > > > > Could we get some comments describing these? What they do, when and > > > where they should be used. I have a suspicion that the naming isn't > > > good, but I didn't spend a lot of time reverse-engineering the > > > intent... > > > > > > > Of course. The next version will have > > > > +/* > > + * Deferred struct page initialisation requires some early init functions that > > + * are removed before kswapd is up and running. The feature depends on memory > > + * hotplug so put the data and code required by deferred initialisation into > > + * the __meminit section where they are preserved. > > + */ > > I'm still not getting it even a little bit :( You say "data and code", > so I'd expect to see > > #define __defer_meminitdata __meminitdata > #define __defer_meminit __meminit > > But the patch doesn't mention the data segment at all. > Take 2. Suggestions on different names are welcome because they are poor. /* * Deferred struct page initialisation requires init functions that are freed * before kswapd is available. Reuse the memory hotplug section annotation * to mark the required code. * * __defermem_init is code that always exists but is annotated __meminit to * avoid section warnings. * __defer_init code gets marked __meminit when deferring struct page * initialistion but is otherwise in the init section. */ -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030489AbbD1QCz (ORCPT ); Tue, 28 Apr 2015 12:02:55 -0400 Received: from relay2.sgi.com ([192.48.180.65]:47163 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1030248AbbD1QCw (ORCPT ); Tue, 28 Apr 2015 12:02:52 -0400 Message-ID: <553FAF26.9060609@sgi.com> Date: Tue, 28 Apr 2015 11:02:46 -0500 From: nzimmer User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Mel Gorman , Andrew Morton CC: Linux-MM , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-3-git-send-email-mgorman@suse.de> <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> <20150428082831.GI2449@suse.de> In-Reply-To: <20150428082831.GI2449@suse.de> Content-Type: text/plain; charset="iso-8859-15"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [128.162.233.123] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is the one I have, but I haven't had a chance to talk with him in a long time. robinmholt@gmail.com On 04/28/2015 03:28 AM, Mel Gorman wrote: > On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote: >> On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman wrote: >> >>> From: Robin Holt >> : : host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1 >> : : Recipient address rejected: User unknown in virtual alias >> : table (in reply to RCPT TO command) >> >> Has Robin moved, or is SGI mail busted? > Robin has moved and I do not have an updated address for him. The > address used in the patches was the one he posted the patches with. > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031216AbbD1WlE (ORCPT ); Tue, 28 Apr 2015 18:41:04 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:35238 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030901AbbD1WlC (ORCPT ); Tue, 28 Apr 2015 18:41:02 -0400 Date: Tue, 28 Apr 2015 15:41:00 -0700 From: Andrew Morton To: Mel Gorman Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , LKML Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. Message-Id: <20150428154100.0f6bd333620b2e744ee66221@linux-foundation.org> In-Reply-To: <20150428082831.GI2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-3-git-send-email-mgorman@suse.de> <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> <20150428082831.GI2449@suse.de> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 28 Apr 2015 09:28:31 +0100 Mel Gorman wrote: > On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote: > > On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman wrote: > > > > > From: Robin Holt > > > > : : host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1 > > : : Recipient address rejected: User unknown in virtual alias > > : table (in reply to RCPT TO command) > > > > Has Robin moved, or is SGI mail busted? > > Robin has moved and I do not have an updated address for him. The > address used in the patches was the one he posted the patches with. > As Nathan mentioned, z:/usr/src/git26> git log | grep "Robin Holt" Cc: Robin Holt Acked-by: Robin Holt Cc: Robin Holt Cc: Robin Holt Cc: Robin Holt From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031227AbbD1XFN (ORCPT ); Tue, 28 Apr 2015 19:05:13 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49150 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031114AbbD1XFL (ORCPT ); Tue, 28 Apr 2015 19:05:11 -0400 Date: Wed, 29 Apr 2015 00:05:06 +0100 From: Mel Gorman To: Andrew Morton Cc: Linux-MM , Nathan Zimmer , Dave Hansen , Waiman Long , Scott Norton , Daniel J Blueman , Robin Holt , LKML Subject: Re: [PATCH 02/13] mm: meminit: Move page initialization into a separate function. Message-ID: <20150428230506.GP2449@suse.de> References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429785196-7668-3-git-send-email-mgorman@suse.de> <20150427154633.2134d804987dad88e008c2ff@linux-foundation.org> <20150428082831.GI2449@suse.de> <20150428154100.0f6bd333620b2e744ee66221@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20150428154100.0f6bd333620b2e744ee66221@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 28, 2015 at 03:41:00PM -0700, Andrew Morton wrote: > On Tue, 28 Apr 2015 09:28:31 +0100 Mel Gorman wrote: > > > On Mon, Apr 27, 2015 at 03:46:33PM -0700, Andrew Morton wrote: > > > On Thu, 23 Apr 2015 11:33:05 +0100 Mel Gorman wrote: > > > > > > > From: Robin Holt > > > > > > : : host cuda-allmx.sgi.com[192.48.157.12] said: 550 cuda_nsu 5.1.1 > > > : : Recipient address rejected: User unknown in virtual alias > > > : table (in reply to RCPT TO command) > > > > > > Has Robin moved, or is SGI mail busted? > > > > Robin has moved and I do not have an updated address for him. The > > address used in the patches was the one he posted the patches with. > > > > As Nathan mentioned, > > z:/usr/src/git26> git log | grep "Robin Holt" > Cc: Robin Holt > Acked-by: Robin Holt > Cc: Robin Holt > Cc: Robin Holt > Cc: Robin Holt I can update the address if Robin wishes (cc'd). I was preserving the address that was used to actually sign off the patches as that was the history. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031346AbbD2Bbz (ORCPT ); Tue, 28 Apr 2015 21:31:55 -0400 Received: from g2t2354.austin.hp.com ([15.217.128.53]:48727 "EHLO g2t2354.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031276AbbD2Bbx (ORCPT ); Tue, 28 Apr 2015 21:31:53 -0400 Message-ID: <55403484.8060906@hp.com> Date: Tue, 28 Apr 2015 21:31:48 -0400 From: Waiman Long User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12 MIME-Version: 1.0 To: Daniel J Blueman CC: Mel Gorman , Linux-MM , Nathan Zimmer , Dave Hansen , Scott Norton , Andrew Morton , LKML , "'Steffen Persvold'" Subject: Re: [PATCH 0/13] Parallel struct page initialisation v3 References: <1429785196-7668-1-git-send-email-mgorman@suse.de> <1429804437.24139.3@cpanel21.proisp.no> In-Reply-To: <1429804437.24139.3@cpanel21.proisp.no> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/23/2015 11:53 AM, Daniel J Blueman wrote: > On Thu, Apr 23, 2015 at 6:33 PM, Mel Gorman wrote: >> The big change here is an adjustment to the topology_init path that >> caused >> soft lockups on Waiman and Daniel Blue had reported it was an expensive >> function. >> >> Changelog since v2 >> o Reduce overhead of topology_init >> o Remove boot-time kernel parameter to enable/disable >> o Enable on UMA >> >> Changelog since v1 >> o Always initialise low zones >> o Typo corrections >> o Rename parallel mem init to parallel struct page init >> o Rebase to 4.0 > [] > > Splendid work! On this 256c setup, topology_init now takes 185ms. > > This brings the kernel boot time down to 324s [1]. It turns out that > one memset is responsible for most of the time setting up the the PUDs > and PMDs; adapting memset to using non-temporal writes [3] avoids > generating RMW cycles, bringing boot time down to 186s [2]. > > If this is a possibility, I can split this patch and map other arch's > memset_nocache to memset, or change the callsite as preferred; > comments welcome. > > Thanks, > Daniel > > [1] https://resources.numascale.com/telemetry/defermem/h8qgl-defer2.txt > [2] > https://resources.numascale.com/telemetry/defermem/h8qgl-defer2-nontemporal.txt > > -- [3] > > From f822139736cab8434302693c635fa146b465273c Mon Sep 17 00:00:00 2001 > From: Daniel J Blueman > Date: Thu, 23 Apr 2015 23:26:27 +0800 > Subject: [RFC] Speedup PMD setup > > Using non-temporal writes prevents read-modify-write cycles, > which are much slower over large topologies. > > Adapt the existing memset() function into a _nocache variant and use > when setting up PMDs during early boot to reduce boot time. > > Signed-off-by: Daniel J Blueman > --- > arch/x86/include/asm/string_64.h | 3 ++ > arch/x86/lib/memset_64.S | 90 > ++++++++++++++++++++++++++++++++++++++++ > mm/memblock.c | 2 +- > 3 files changed, 94 insertions(+), 1 deletion(-) > I tried your patch on my 12-TB IvyBridge-EX test machine and the bootup time increased from 265s to 289s (24s increase). I think my IvyBridge-EX box was using the optimized memset_c_e (rep stosb) code which turned out to perform better than the non-temporal move in your code. I think that may be due to the temporal moves that need to be done at the beginning and end of the memory range. I had tried to replace clear_page() with non-temporal moves. I generally got about a few percentage points improvement compared with the optimized clear_page_c() and clear_page_c_e() code. That is not a lot. Anyway, I think the AMD box that you used wasn't setting the X86_FEATURE_REP_GOOD or X86_FEATURE_ERMS bits resulting in poor memset performance. If such a feature is supported in the AMD CPU (albeit in a different way), you may consider sending in patch to set those features bit. Alternatively, you will need to duplicate the alternative instruction stuff in your memset_nocache() to make sure that it can use the optimized code, if appropriate. Cheers, Longman