[PATCH v2 0/1] initialize pages on demand during boot

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/1] initialize pages on demand during boot
@ 2018-02-08 18:45 Pavel Tatashin
  2018-02-08 18:45 ` [PATCH v2 1/1] mm: " Pavel Tatashin
  0 siblings, 1 reply; 7+ messages in thread
From: Pavel Tatashin @ 2018-02-08 18:45 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, pasha.tatashin, m.mizuma, akpm,
	mhocko, catalin.marinas, takahiro.akashi, gi-oh.kim,
	heiko.carstens, baiyaowei, richard.weiyang, paul.burton,
	miles.chen, vbabka, mgorman, hannes, linux-kernel, linux-mm

Change log:
	v1 - v2: Added Tested-by: Masayoshi Mizuma

Answering Andrew Morton's questions:

> Presumably this fixes some real-world problem which someone has observed?

Yes, linked below.

> Please describe that problem for us in lavish detail.

This change helps for three reasons:

1. Insufficient amount of reserved memory due to arguments provided by
user. User may request some buffers, increased hash tables sizes etc.
Currently, machine panics during boot if it can't allocate memory due
to insufficient amount of reserved memory. With this change, it will
be able to grow zone before deferred pages are initialized.

One observed example is described in the linked discussion [1] Mel
Gorman writes:

"
Yasuaki Ishimatsu reported a premature OOM when trace_buf_size=100m was
specified on a machine with many CPUs. The kernel tried to allocate 38.4GB
but only 16GB was available due to deferred memory initialisation.
"

The allocations in the above scenario happen per-cpu in smp_init(),
and before deferred pages are initialized. So, there is no way to
predict how much memory we should put aside to boot successfully with
deferred page initialization feature compiled in.

2. The second reason is future proof. The kernel memory requirements
may change, and we do not want to constantly update
reset_deferred_meminit() to satisfy the new requirements. In addition,
this function is currently in common code, but potentially would need
to be split into arch specific variants, as more arches will start
taking advantage of deferred page initialization feature.

3. On demand initialization of reserved pages guarantees that we will
initialize only as many pages early in boot using only one thread as
needed, the rest are going to be efficiently initialized in parallel.

[1] https://www.spinics.net/lists/linux-mm/msg139087.html

Pavel Tatashin (1):
  mm: initialize pages on demand during boot

 include/linux/memblock.h |  10 ---
 mm/memblock.c            |  23 -------
 mm/page_alloc.c          | 164 ++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 125 insertions(+), 72 deletions(-)

-- 
2.16.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/1] mm: initialize pages on demand during boot
  2018-02-08 18:45 [PATCH v2 0/1] initialize pages on demand during boot Pavel Tatashin
@ 2018-02-08 18:45 ` Pavel Tatashin
  2018-02-08 20:03   ` Andrew Morton
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Pavel Tatashin @ 2018-02-08 18:45 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, pasha.tatashin, m.mizuma, akpm,
	mhocko, catalin.marinas, takahiro.akashi, gi-oh.kim,
	heiko.carstens, baiyaowei, richard.weiyang, paul.burton,
	miles.chen, vbabka, mgorman, hannes, linux-kernel, linux-mm

Deferred page initialization allows the boot cpu to initialize a small
subset of the system's pages early in boot, with other cpus doing the rest
later on.

It is, however, problematic to know how many pages the kernel needs during
boot.  Different modules and kernel parameters may change the requirement,
so the boot cpu either initializes too many pages or runs out of memory.

To fix that, initialize early pages on demand.  This ensures the kernel
does the minimum amount of work to initialize pages during boot and leaves
the rest to be divided in the multithreaded initialization path
(deferred_init_memmap).

The on-demand code is permanently disabled using static branching once
deferred pages are initialized.  After the static branch is changed to
false, the overhead is up-to two branch-always instructions if the zone
watermark check fails or if rmqueue fails.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
---
 include/linux/memblock.h |  10 ---
 mm/memblock.c            |  23 -------
 mm/page_alloc.c          | 164 ++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 125 insertions(+), 72 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 8be5077efb5f..6c305afd95ab 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -417,21 +417,11 @@ static inline void early_memtest(phys_addr_t start, phys_addr_t end)
 {
 }
 #endif
-
-extern unsigned long memblock_reserved_memory_within(phys_addr_t start_addr,
-		phys_addr_t end_addr);
 #else
 static inline phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align)
 {
 	return 0;
 }
-
-static inline unsigned long memblock_reserved_memory_within(phys_addr_t start_addr,
-		phys_addr_t end_addr)
-{
-	return 0;
-}
-
 #endif /* CONFIG_HAVE_MEMBLOCK */
 
 #endif /* __KERNEL__ */
diff --git a/mm/memblock.c b/mm/memblock.c
index 5a9ca2a1751b..4120e9f536f7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1778,29 +1778,6 @@ static void __init_memblock memblock_dump(struct memblock_type *type)
 	}
 }
 
-extern unsigned long __init_memblock
-memblock_reserved_memory_within(phys_addr_t start_addr, phys_addr_t end_addr)
-{
-	struct memblock_region *rgn;
-	unsigned long size = 0;
-	int idx;
-
-	for_each_memblock_type(idx, (&memblock.reserved), rgn) {
-		phys_addr_t start, end;
-
-		if (rgn->base + rgn->size < start_addr)
-			continue;
-		if (rgn->base > end_addr)
-			continue;
-
-		start = rgn->base;
-		end = start + rgn->size;
-		size += end - start;
-	}
-
-	return size;
-}
-
 void __init_memblock __memblock_dump_all(void)
 {
 	pr_info("MEMBLOCK configuration:\n");
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e18ceef579..13d56b28e68e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -291,40 +291,6 @@ EXPORT_SYMBOL(nr_online_nodes);
 int page_group_by_mobility_disabled __read_mostly;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-
-/*
- * Determine how many pages need to be initialized during early boot
- * (non-deferred initialization).
- * The value of first_deferred_pfn will be set later, once non-deferred pages
- * are initialized, but for now set it ULONG_MAX.
- */
-static inline void reset_deferred_meminit(pg_data_t *pgdat)
-{
-	phys_addr_t start_addr, end_addr;
-	unsigned long max_pgcnt;
-	unsigned long reserved;
-
-	/*
-	 * Initialise at least 2G of a node but also take into account that
-	 * two large system hashes that can take up 1GB for 0.25TB/node.
-	 */
-	max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
-			(pgdat->node_spanned_pages >> 8));
-
-	/*
-	 * Compensate the all the memblock reservations (e.g. crash kernel)
-	 * from the initial estimation to make sure we will initialize enough
-	 * memory to boot.
-	 */
-	start_addr = PFN_PHYS(pgdat->node_start_pfn);
-	end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
-	reserved = memblock_reserved_memory_within(start_addr, end_addr);
-	max_pgcnt += PHYS_PFN(reserved);
-
-	pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
-	pgdat->first_deferred_pfn = ULONG_MAX;
-}
-
 /* Returns true if the struct page for the pfn is uninitialised */
 static inline bool __meminit early_page_uninitialised(unsigned long pfn)
 {
@@ -357,10 +323,6 @@ static inline bool update_defer_init(pg_data_t *pgdat,
 	return true;
 }
 #else
-static inline void reset_deferred_meminit(pg_data_t *pgdat)
-{
-}
-
 static inline bool early_page_uninitialised(unsigned long pfn)
 {
 	return false;
@@ -1604,6 +1566,96 @@ static int __init deferred_init_memmap(void *data)
 	pgdat_init_report_one_done();
 	return 0;
 }
+
+/*
+ * Protects some early interrupt threads, and also for a short period of time
+ * from  smp_init() to page_alloc_init_late() when deferred pages are
+ * initialized.
+ */
+static __initdata DEFINE_SPINLOCK(deferred_zone_grow_lock);
+DEFINE_STATIC_KEY_TRUE(deferred_pages);
+
+/*
+ * If this zone has deferred pages, try to grow it by initializing enough
+ * deferred pages to satisfy the allocation specified by order, rounded up to
+ * the nearest PAGES_PER_SECTION boundary.  So we're adding memory in increments
+ * of SECTION_SIZE bytes by initializing struct pages in increments of
+ * PAGES_PER_SECTION * sizeof(struct page) bytes.
+ */
+static noinline bool __init
+deferred_grow_zone(struct zone *zone, unsigned int order)
+{
+	int zid = zone_idx(zone);
+	int nid = zone->node;
+	pg_data_t *pgdat = NODE_DATA(nid);
+	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
+	unsigned long nr_pages = 0;
+	unsigned long first_init_pfn, first_deferred_pfn, spfn, epfn, t;
+	phys_addr_t spa, epa;
+	u64 i;
+
+	/* Only the last zone may have deferred pages */
+	if (zone_end_pfn(zone) != pgdat_end_pfn(pgdat))
+		return false;
+
+	first_deferred_pfn = READ_ONCE(pgdat->first_deferred_pfn);
+	first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
+
+	if (first_init_pfn >= pgdat_end_pfn(pgdat))
+		return false;
+
+	spin_lock(&deferred_zone_grow_lock);
+	/*
+	 * Bail if we raced with another thread that disabled on demand
+	 * initialization.
+	 */
+	if (!static_branch_unlikely(&deferred_pages)) {
+		spin_unlock(&deferred_zone_grow_lock);
+		return false;
+	}
+
+	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
+		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
+
+		while (spfn < epfn && nr_pages < nr_pages_needed) {
+			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
+			first_deferred_pfn = min(t, epfn);
+			nr_pages += deferred_init_pages(nid, zid, spfn,
+							first_deferred_pfn);
+			spfn = first_deferred_pfn;
+		}
+
+		if (nr_pages >= nr_pages_needed)
+			break;
+	}
+
+	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
+		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
+		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
+		deferred_free_pages(nid, zid, spfn, epfn);
+
+		if (first_deferred_pfn == epfn)
+			break;
+	}
+	WRITE_ONCE(pgdat->first_deferred_pfn, first_deferred_pfn);
+	spin_unlock(&deferred_zone_grow_lock);
+
+	return nr_pages >= nr_pages_needed;
+}
+
+/*
+ * deferred_grow_zone() is __init, but it is called from
+ * get_page_from_freelist() during early boot until deferred_pages permanently
+ * disables this call. This is why, we have refdata wrapper to avoid warning,
+ * and ensure that the function body gets unloaded.
+ */
+static bool __ref
+_deferred_grow_zone(struct zone *zone, unsigned int order)
+{
+	return deferred_grow_zone(zone, order);
+}
+
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 void __init page_alloc_init_late(void)
@@ -1613,6 +1665,14 @@ void __init page_alloc_init_late(void)
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	int nid;
 
+	/*
+	 * We are about to initialize the rest of deferred pages, permanently
+	 * disable on-demand struct page initialization.
+	 */
+	spin_lock(&deferred_zone_grow_lock);
+	static_branch_disable(&deferred_pages);
+	spin_unlock(&deferred_zone_grow_lock);
+
 	/* There will be num_node_state(N_MEMORY) threads */
 	atomic_set(&pgdat_init_n_undone, num_node_state(N_MEMORY));
 	for_each_node_state(nid, N_MEMORY) {
@@ -3199,6 +3259,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				       ac_classzone_idx(ac), alloc_flags)) {
 			int ret;
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+			/*
+			 * Watermark failed for this zone, but see if we can
+			 * grow this zone if it contains deferred pages.
+			 */
+			if (static_branch_unlikely(&deferred_pages)) {
+				if (_deferred_grow_zone(zone, order))
+					goto try_this_zone;
+			}
+#endif
 			/* Checked here to keep the fast path fast */
 			BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 			if (alloc_flags & ALLOC_NO_WATERMARKS)
@@ -3240,6 +3310,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				reserve_highatomic_pageblock(page, zone, order);
 
 			return page;
+		} else {
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+			/* Try again if zone has deferred pages */
+			if (static_branch_unlikely(&deferred_pages)) {
+				if (_deferred_grow_zone(zone, order))
+					goto try_this_zone;
+			}
+#endif
 		}
 	}
 
@@ -6247,7 +6325,15 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
 	alloc_node_mem_map(pgdat);
 
-	reset_deferred_meminit(pgdat);
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+	/*
+	 * We start only with one section of pages, more pages are added as
+	 * needed until the rest of deferred pages are initialized.
+	 */
+	pgdat->static_init_pgcnt = min(PAGES_PER_SECTION,
+				       pgdat->node_spanned_pages);
+	pgdat->first_deferred_pfn = ULONG_MAX;
+#endif
 	free_area_init_core(pgdat);
 }
 
-- 
2.16.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/1] mm: initialize pages on demand during boot
  2018-02-08 18:45 ` [PATCH v2 1/1] mm: " Pavel Tatashin
@ 2018-02-08 20:03   ` Andrew Morton
  2018-02-08 22:27     ` Pavel Tatashin
  2018-02-10  5:24   ` kbuild test robot
  2018-02-10  5:37   ` kbuild test robot
  2 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2018-02-08 20:03 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, m.mizuma, mhocko,
	catalin.marinas, takahiro.akashi, gi-oh.kim, heiko.carstens,
	baiyaowei, richard.weiyang, paul.burton, miles.chen, vbabka,
	mgorman, hannes, linux-kernel, linux-mm

On Thu,  8 Feb 2018 13:45:55 -0500 Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> Deferred page initialization allows the boot cpu to initialize a small
> subset of the system's pages early in boot, with other cpus doing the rest
> later on.
> 
> It is, however, problematic to know how many pages the kernel needs during
> boot.  Different modules and kernel parameters may change the requirement,
> so the boot cpu either initializes too many pages or runs out of memory.
> 
> To fix that, initialize early pages on demand.  This ensures the kernel
> does the minimum amount of work to initialize pages during boot and leaves
> the rest to be divided in the multithreaded initialization path
> (deferred_init_memmap).
> 
> The on-demand code is permanently disabled using static branching once
> deferred pages are initialized.  After the static branch is changed to
> false, the overhead is up-to two branch-always instructions if the zone
> watermark check fails or if rmqueue fails.
>
> ...
>
> @@ -1604,6 +1566,96 @@ static int __init deferred_init_memmap(void *data)
>  	pgdat_init_report_one_done();
>  	return 0;
>  }
> +
> +/*
> + * Protects some early interrupt threads, and also for a short period of time
> + * from  smp_init() to page_alloc_init_late() when deferred pages are
> + * initialized.
> + */
> +static __initdata DEFINE_SPINLOCK(deferred_zone_grow_lock);

Comment is a little confusing.  Locks don't protect "threads" - they
protect data.  Can we be specific about which data is being protected?

Why is a new lock needed here?  Those data structures already have
designated locks, don't they?

If the lock protects "early interrupt threads" then it's surprising to
see it taken with spin_lock() and not spin_lock_irqsave()?

> +DEFINE_STATIC_KEY_TRUE(deferred_pages);
> +
> +/*
> + * If this zone has deferred pages, try to grow it by initializing enough
> + * deferred pages to satisfy the allocation specified by order, rounded up to
> + * the nearest PAGES_PER_SECTION boundary.  So we're adding memory in increments
> + * of SECTION_SIZE bytes by initializing struct pages in increments of
> + * PAGES_PER_SECTION * sizeof(struct page) bytes.
> + */

Please also document the return value.

> +static noinline bool __init

Why was noinline needed?

> +deferred_grow_zone(struct zone *zone, unsigned int order)
> +{
> +	int zid = zone_idx(zone);
> +	int nid = zone->node;
> +	pg_data_t *pgdat = NODE_DATA(nid);
> +	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
> +	unsigned long nr_pages = 0;
> +	unsigned long first_init_pfn, first_deferred_pfn, spfn, epfn, t;
> +	phys_addr_t spa, epa;
> +	u64 i;
> +
> +	/* Only the last zone may have deferred pages */
> +	if (zone_end_pfn(zone) != pgdat_end_pfn(pgdat))
> +		return false;
> +
> +	first_deferred_pfn = READ_ONCE(pgdat->first_deferred_pfn);

It would be nice to have a little comment explaining why READ_ONCE was
needed.

Would it still be needed if this code was moved into the locked region?

> +	first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
> +
> +	if (first_init_pfn >= pgdat_end_pfn(pgdat))
> +		return false;
> +
> +	spin_lock(&deferred_zone_grow_lock);
> +	/*
> +	 * Bail if we raced with another thread that disabled on demand
> +	 * initialization.
> +	 */
> +	if (!static_branch_unlikely(&deferred_pages)) {
> +		spin_unlock(&deferred_zone_grow_lock);
> +		return false;
> +	}
> +
> +	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> +		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> +		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> +
> +		while (spfn < epfn && nr_pages < nr_pages_needed) {
> +			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
> +			first_deferred_pfn = min(t, epfn);
> +			nr_pages += deferred_init_pages(nid, zid, spfn,
> +							first_deferred_pfn);
> +			spfn = first_deferred_pfn;
> +		}
> +
> +		if (nr_pages >= nr_pages_needed)
> +			break;
> +	}
> +
> +	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> +		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> +		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
> +		deferred_free_pages(nid, zid, spfn, epfn);
> +
> +		if (first_deferred_pfn == epfn)
> +			break;
> +	}
> +	WRITE_ONCE(pgdat->first_deferred_pfn, first_deferred_pfn);
> +	spin_unlock(&deferred_zone_grow_lock);
> +
> +	return nr_pages >= nr_pages_needed;
> +}
> +
> +/*
> + * deferred_grow_zone() is __init, but it is called from
> + * get_page_from_freelist() during early boot until deferred_pages permanently
> + * disables this call. This is why, we have refdata wrapper to avoid warning,
> + * and ensure that the function body gets unloaded.

s/why,/why/
s/ensure/to ensure/

> + */
> +static bool __ref
> +_deferred_grow_zone(struct zone *zone, unsigned int order)
> +{
> +	return deferred_grow_zone(zone, order);
> +}
> +
>  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>  
>  void __init page_alloc_init_late(void)
> @@ -1613,6 +1665,14 @@ void __init page_alloc_init_late(void)
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  	int nid;
>  
> +	/*
> +	 * We are about to initialize the rest of deferred pages, permanently
> +	 * disable on-demand struct page initialization.
> +	 */
> +	spin_lock(&deferred_zone_grow_lock);
> +	static_branch_disable(&deferred_pages);
> +	spin_unlock(&deferred_zone_grow_lock);

Ah, so the new lock is to protect the static branch machinery only?

>  	/* There will be num_node_state(N_MEMORY) threads */
>  	atomic_set(&pgdat_init_n_undone, num_node_state(N_MEMORY));
>  	for_each_node_state(nid, N_MEMORY) {
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/1] mm: initialize pages on demand during boot
  2018-02-08 20:03   ` Andrew Morton
@ 2018-02-08 22:27     ` Pavel Tatashin
  2018-02-09  0:58       ` Pavel Tatashin
  0 siblings, 1 reply; 7+ messages in thread
From: Pavel Tatashin @ 2018-02-08 22:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: steven.sistare, daniel.m.jordan, m.mizuma, mhocko,
	catalin.marinas, takahiro.akashi, gi-oh.kim, heiko.carstens,
	baiyaowei, richard.weiyang, paul.burton, miles.chen, vbabka,
	mgorman, hannes, linux-kernel, linux-mm

Hi Andrew,

Thank you for your comments. My replies below:

>> +
>> +/*
>> + * Protects some early interrupt threads, and also for a short period of time
>> + * from  smp_init() to page_alloc_init_late() when deferred pages are
>> + * initialized.
>> + */
>> +static __initdata DEFINE_SPINLOCK(deferred_zone_grow_lock);
> 
> Comment is a little confusing.  Locks don't protect "threads" - they
> protect data.  Can we be specific about which data is being protected?

I will update the comment, explaining that this lock protects 
first_deferred_pfn in all zones. The lock is discarded after boot, hence 
it is in __initdata.

> 
> Why is a new lock needed here?  Those data structures already have
> designated locks, don't they?

No, there is no lock for this particular purpose. Before this commit, 
first_deferred_pfn was only updated early in boot before the boot thread 
can be preempted. And, only once after all pages are initialized to mark 
that there are no deferred pages anymore. With this commit, the number 
of deferred pages can change later in boot, this is why we need this new 
lock, but for a relatively short period of boot time.

> 
> If the lock protects "early interrupt threads" then it's surprising to
> see it taken with spin_lock() and not spin_lock_irqsave()?

Yes, I will update the code to use spin_lock_irqsave(), thank you.

> 
>> +DEFINE_STATIC_KEY_TRUE(deferred_pages);
>> +
>> +/*
>> + * If this zone has deferred pages, try to grow it by initializing enough
>> + * deferred pages to satisfy the allocation specified by order, rounded up to
>> + * the nearest PAGES_PER_SECTION boundary.  So we're adding memory in increments
>> + * of SECTION_SIZE bytes by initializing struct pages in increments of
>> + * PAGES_PER_SECTION * sizeof(struct page) bytes.
>> + */
> 
> Please also document the return value.
> 
>> +static noinline bool __init
> 
> Why was noinline needed?

To save space after boot. We want the body of this function to be 
unloaded after the boot when __init pages are unmapped, and we do not 
want this function to be inlined into : _deferred_grow_zone() which is 
__ref function.

> 
>> +deferred_grow_zone(struct zone *zone, unsigned int order)
>> +{
>> +	int zid = zone_idx(zone);
>> +	int nid = zone->node;
>> +	pg_data_t *pgdat = NODE_DATA(nid);
>> +	unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
>> +	unsigned long nr_pages = 0;
>> +	unsigned long first_init_pfn, first_deferred_pfn, spfn, epfn, t;
>> +	phys_addr_t spa, epa;
>> +	u64 i;
>> +
>> +	/* Only the last zone may have deferred pages */
>> +	if (zone_end_pfn(zone) != pgdat_end_pfn(pgdat))
>> +		return false;
>> +
>> +	first_deferred_pfn = READ_ONCE(pgdat->first_deferred_pfn);
> 
> It would be nice to have a little comment explaining why READ_ONCE was
> needed.
> 
> Would it still be needed if this code was moved into the locked region?

No, we would need to use READ_ONCE() if we grabbed 
deferred_zone_grow_lock before this code. In fact I do not even think we 
strictly need READ_ONCE() here, as it is a single load anyway. But, 
because we are outside of the lock, and we want to quickly fetch the 
data with a single load, I think it makes sense to emphasize it using 
READ_ONCE() without expected compiler to simply do the write thing for us.

> 
>> +	first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
>> +
>> +	if (first_init_pfn >= pgdat_end_pfn(pgdat))
>> +		return false;
>> +
>> +	spin_lock(&deferred_zone_grow_lock);
>> +	/*
>> +	 * Bail if we raced with another thread that disabled on demand
>> +	 * initialization.
>> +	 */
>> +	if (!static_branch_unlikely(&deferred_pages)) {
>> +		spin_unlock(&deferred_zone_grow_lock);
>> +		return false;
>> +	}
>> +
>> +	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
>> +		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
>> +		epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
>> +
>> +		while (spfn < epfn && nr_pages < nr_pages_needed) {
>> +			t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
>> +			first_deferred_pfn = min(t, epfn);
>> +			nr_pages += deferred_init_pages(nid, zid, spfn,
>> +							first_deferred_pfn);
>> +			spfn = first_deferred_pfn;
>> +		}
>> +
>> +		if (nr_pages >= nr_pages_needed)
>> +			break;
>> +	}
>> +
>> +	for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
>> +		spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
>> +		epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
>> +		deferred_free_pages(nid, zid, spfn, epfn);
>> +
>> +		if (first_deferred_pfn == epfn)
>> +			break;
>> +	}
>> +	WRITE_ONCE(pgdat->first_deferred_pfn, first_deferred_pfn);
>> +	spin_unlock(&deferred_zone_grow_lock);
>> +
>> +	return nr_pages >= nr_pages_needed;
>> +}
>> +
>> +/*
>> + * deferred_grow_zone() is __init, but it is called from
>> + * get_page_from_freelist() during early boot until deferred_pages permanently
>> + * disables this call. This is why, we have refdata wrapper to avoid warning,
>> + * and ensure that the function body gets unloaded.
> 
> s/why,/why/
> s/ensure/to ensure/

OK, thank you.

> 
>> + */
>> +static bool __ref
>> +_deferred_grow_zone(struct zone *zone, unsigned int order)
>> +{
>> +	return deferred_grow_zone(zone, order);
>> +}
>> +
>>   #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>>   
>>   void __init page_alloc_init_late(void)
>> @@ -1613,6 +1665,14 @@ void __init page_alloc_init_late(void)
>>   #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>>   	int nid;
>>   
>> +	/*
>> +	 * We are about to initialize the rest of deferred pages, permanently
>> +	 * disable on-demand struct page initialization.
>> +	 */
>> +	spin_lock(&deferred_zone_grow_lock);
>> +	static_branch_disable(&deferred_pages);
>> +	spin_unlock(&deferred_zone_grow_lock);
> 
> Ah, so the new lock is to protect the static branch machinery only?

This lock is needed when several threads are trying to allocate memory 
simultaneously, and there is no enough pages in the zone to do so, but 
there are still deferred pages available.

Thank you,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/1] mm: initialize pages on demand during boot
  2018-02-08 22:27     ` Pavel Tatashin
@ 2018-02-09  0:58       ` Pavel Tatashin
  0 siblings, 0 replies; 7+ messages in thread
From: Pavel Tatashin @ 2018-02-09  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Steve Sistare, Daniel Jordan, m.mizuma, Michal Hocko,
	Catalin Marinas, AKASHI Takahiro, Gioh Kim, Heiko Carstens,
	baiyaowei, Wei Yang, Paul Burton, Miles Chen, Vlastimil Babka,
	Mel Gorman, Johannes Weiner, Linux Kernel Mailing List,
	Linux Memory Management List

>>
>> It would be nice to have a little comment explaining why READ_ONCE was
>> needed.
>>
>> Would it still be needed if this code was moved into the locked region?
>
>
> No, we would need to use READ_ONCE() if we grabbed deferred_zone_grow_lock
> before this code. In fact I do not even think we strictly need READ_ONCE()
> here, as it is a single load anyway. But, because we are outside of the
> lock, and we want to quickly fetch the data with a single load, I think it
> makes sense to emphasize it using READ_ONCE() without expected compiler to
> simply do the write thing for us.
>
>

Correction:

No, we would NOT need ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/1] mm: initialize pages on demand during boot
  2018-02-08 18:45 ` [PATCH v2 1/1] mm: " Pavel Tatashin
  2018-02-08 20:03   ` Andrew Morton
@ 2018-02-10  5:24   ` kbuild test robot
  2018-02-10  5:37   ` kbuild test robot
  2 siblings, 0 replies; 7+ messages in thread
From: kbuild test robot @ 2018-02-10  5:24 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: kbuild-all, steven.sistare, daniel.m.jordan, m.mizuma, akpm,
	mhocko, catalin.marinas, takahiro.akashi, gi-oh.kim,
	heiko.carstens, baiyaowei, richard.weiyang, paul.burton,
	miles.chen, vbabka, mgorman, hannes, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3802 bytes --]

Hi Pavel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mmotm/master]
[also build test ERROR on next-20180209]
[cannot apply to v4.15]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Pavel-Tatashin/mm-initialize-pages-on-demand-during-boot/20180210-125104
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x017-201805 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   mm/page_alloc.c: In function 'deferred_grow_zone':
>> mm/page_alloc.c:1590:18: error: 'struct zone' has no member named 'node'; did you mean 'name'?
     int nid = zone->node;
                     ^~~~
                     name

vim +1590 mm/page_alloc.c

  1578	
  1579	/*
  1580	 * If this zone has deferred pages, try to grow it by initializing enough
  1581	 * deferred pages to satisfy the allocation specified by order, rounded up to
  1582	 * the nearest PAGES_PER_SECTION boundary.  So we're adding memory in increments
  1583	 * of SECTION_SIZE bytes by initializing struct pages in increments of
  1584	 * PAGES_PER_SECTION * sizeof(struct page) bytes.
  1585	 */
  1586	static noinline bool __init
  1587	deferred_grow_zone(struct zone *zone, unsigned int order)
  1588	{
  1589		int zid = zone_idx(zone);
> 1590		int nid = zone->node;
  1591		pg_data_t *pgdat = NODE_DATA(nid);
  1592		unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
  1593		unsigned long nr_pages = 0;
  1594		unsigned long first_init_pfn, first_deferred_pfn, spfn, epfn, t;
  1595		phys_addr_t spa, epa;
  1596		u64 i;
  1597	
  1598		/* Only the last zone may have deferred pages */
  1599		if (zone_end_pfn(zone) != pgdat_end_pfn(pgdat))
  1600			return false;
  1601	
  1602		first_deferred_pfn = READ_ONCE(pgdat->first_deferred_pfn);
  1603		first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
  1604	
  1605		if (first_init_pfn >= pgdat_end_pfn(pgdat))
  1606			return false;
  1607	
  1608		spin_lock(&deferred_zone_grow_lock);
  1609		/*
  1610		 * Bail if we raced with another thread that disabled on demand
  1611		 * initialization.
  1612		 */
  1613		if (!static_branch_unlikely(&deferred_pages)) {
  1614			spin_unlock(&deferred_zone_grow_lock);
  1615			return false;
  1616		}
  1617	
  1618		for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
  1619			spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
  1620			epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
  1621	
  1622			while (spfn < epfn && nr_pages < nr_pages_needed) {
  1623				t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
  1624				first_deferred_pfn = min(t, epfn);
  1625				nr_pages += deferred_init_pages(nid, zid, spfn,
  1626								first_deferred_pfn);
  1627				spfn = first_deferred_pfn;
  1628			}
  1629	
  1630			if (nr_pages >= nr_pages_needed)
  1631				break;
  1632		}
  1633	
  1634		for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
  1635			spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
  1636			epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
  1637			deferred_free_pages(nid, zid, spfn, epfn);
  1638	
  1639			if (first_deferred_pfn == epfn)
  1640				break;
  1641		}
  1642		WRITE_ONCE(pgdat->first_deferred_pfn, first_deferred_pfn);
  1643		spin_unlock(&deferred_zone_grow_lock);
  1644	
  1645		return nr_pages >= nr_pages_needed;
  1646	}
  1647	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30903 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/1] mm: initialize pages on demand during boot
  2018-02-08 18:45 ` [PATCH v2 1/1] mm: " Pavel Tatashin
  2018-02-08 20:03   ` Andrew Morton
  2018-02-10  5:24   ` kbuild test robot
@ 2018-02-10  5:37   ` kbuild test robot
  2 siblings, 0 replies; 7+ messages in thread
From: kbuild test robot @ 2018-02-10  5:37 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: kbuild-all, steven.sistare, daniel.m.jordan, m.mizuma, akpm,
	mhocko, catalin.marinas, takahiro.akashi, gi-oh.kim,
	heiko.carstens, baiyaowei, richard.weiyang, paul.burton,
	miles.chen, vbabka, mgorman, hannes, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3117 bytes --]

Hi Pavel,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on mmotm/master]
[also build test WARNING on next-20180209]
[cannot apply to v4.15]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Pavel-Tatashin/mm-initialize-pages-on-demand-during-boot/20180210-125104
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x018-201805 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:18:0,
                    from arch/x86/include/asm/bug.h:82,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from mm/page_alloc.c:18:
   mm/page_alloc.c: In function 'free_area_init_node':
   include/linux/kernel.h:792:16: warning: comparison of distinct pointer types lacks a cast
     (void) (&min1 == &min2);   \
                   ^
   include/linux/kernel.h:801:2: note: in expansion of macro '__min'
     __min(typeof(x), typeof(y),   \
     ^~~~~
>> mm/page_alloc.c:6357:29: note: in expansion of macro 'min'
     pgdat->static_init_pgcnt = min(PAGES_PER_SECTION,
                                ^~~

vim +/min +6357 mm/page_alloc.c

  6325	
  6326	void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
  6327			unsigned long node_start_pfn, unsigned long *zholes_size)
  6328	{
  6329		pg_data_t *pgdat = NODE_DATA(nid);
  6330		unsigned long start_pfn = 0;
  6331		unsigned long end_pfn = 0;
  6332	
  6333		/* pg_data_t should be reset to zero when it's allocated */
  6334		WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
  6335	
  6336		pgdat->node_id = nid;
  6337		pgdat->node_start_pfn = node_start_pfn;
  6338		pgdat->per_cpu_nodestats = NULL;
  6339	#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
  6340		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
  6341		pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
  6342			(u64)start_pfn << PAGE_SHIFT,
  6343			end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0);
  6344	#else
  6345		start_pfn = node_start_pfn;
  6346	#endif
  6347		calculate_node_totalpages(pgdat, start_pfn, end_pfn,
  6348					  zones_size, zholes_size);
  6349	
  6350		alloc_node_mem_map(pgdat);
  6351	
  6352	#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
  6353		/*
  6354		 * We start only with one section of pages, more pages are added as
  6355		 * needed until the rest of deferred pages are initialized.
  6356		 */
> 6357		pgdat->static_init_pgcnt = min(PAGES_PER_SECTION,
  6358					       pgdat->node_spanned_pages);
  6359		pgdat->first_deferred_pfn = ULONG_MAX;
  6360	#endif
  6361		free_area_init_core(pgdat);
  6362	}
  6363	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26927 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-02-10  5:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-08 18:45 [PATCH v2 0/1] initialize pages on demand during boot Pavel Tatashin
2018-02-08 18:45 ` [PATCH v2 1/1] mm: " Pavel Tatashin
2018-02-08 20:03   ` Andrew Morton
2018-02-08 22:27     ` Pavel Tatashin
2018-02-09  0:58       ` Pavel Tatashin
2018-02-10  5:24   ` kbuild test robot
2018-02-10  5:37   ` kbuild test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).