* [PATCH v5] Spread boot time scrubbing across available CPUs. @ 2014-06-04 13:29 Konrad Rzeszutek Wilk 2014-06-04 13:29 ` [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) Konrad Rzeszutek Wilk 0 siblings, 1 reply; 7+ messages in thread From: Konrad Rzeszutek Wilk @ 2014-06-04 13:29 UTC (permalink / raw) To: xen-devel, JBeulich, andrew.cooper3, tim, dario.faggioli, julien.grall Please see v5 of this patchset. It should have all review comments addressed. I did change the algorithm for the NUMA-node-but-no-CPUs code. It now picks the closest NUMA node CPUs to do the scrubbing. If that node does not have any CPUs it will continue on until it finds something - or it falls back on the first node. And if the first node has no CPUs either - it will just pick the BSP and call it a day. That hopefully should take care of it running on broken hardware. I've also cross compiled it on ARM but hadn't yet run the emulator to make sure it works right. I figured I would do that once the x86 folks are comfortable with the patch. Thank you everybody for reviewing the patch over and over. docs/misc/xen-command-line.markdown | 10 ++ xen/common/page_alloc.c | 208 ++++++++++++++++++++++++++++++++---- xen/include/asm-arm/numa.h | 1 + 3 files changed, 201 insertions(+), 18 deletions(-) ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) 2014-06-04 13:29 [PATCH v5] Spread boot time scrubbing across available CPUs Konrad Rzeszutek Wilk @ 2014-06-04 13:29 ` Konrad Rzeszutek Wilk 2014-06-04 13:35 ` Andrew Cooper ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Konrad Rzeszutek Wilk @ 2014-06-04 13:29 UTC (permalink / raw) To: xen-devel, JBeulich, andrew.cooper3, tim, dario.faggioli, julien.grall Cc: Malcolm Crossley, Konrad Rzeszutek Wilk From: Malcolm Crossley <malcolm.crossley@citrix.com> The page scrubbing is done in 128MB chunks in lockstep across all the non-SMT CPU's. This allows for the boot CPU to hold the heap_lock whilst each chunk is being scrubbed and then release the heap_lock when the CPU's are finished scrubing their individual chunk. This allows for the heap_lock to not be held continously and for pending softirqs are to be serviced periodically across the CPU's. The page scrub memory chunks are allocated to the CPU's in a NUMA aware fashion to reduce socket interconnect overhead and improve performance. Specifically in the first phase we scrub at the same time on all the NUMA nodes that have CPUs - we also weed out the SMT threads so that we only use cores (that gives a 50% boost). The second phase is for NUMA nodes that have no CPUs - for that we use the closest NUMA node's CPUs (non-SMT again) to do the job. This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron 6386 machine from 49 seconds to 3 seconds. On a IvyBridge-EX 8 socket box with 1.5TB it cuts it down from 15 minutes to 63 seconds. v2 - Reduced default chunk size to 128MB - Added code to scrub NUMA nodes with no active CPU linked to them - Be robust to boot CPU not being linked to a NUMA node v3: - Don't use SMT threads - Take care of remainder if the number of CPUs (or memory) is odd - Restructure the worker thread - s/u64/unsigned long/ v4: - Don't use all CPUs on non-CPU NUMA nodes, just closest one - Syntax, and docs updates - Compile on ARM - Fix bug when NUMA node has 0 pages v5: - Properly figure out best NUMA node. - Fix comments to be proper style. Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- docs/misc/xen-command-line.markdown | 10 ++ xen/common/page_alloc.c | 208 ++++++++++++++++++++++++++++++++---- xen/include/asm-arm/numa.h | 1 + 3 files changed, 201 insertions(+), 18 deletions(-) diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown index 514c7b2..39a67be 100644 --- a/docs/misc/xen-command-line.markdown +++ b/docs/misc/xen-command-line.markdown @@ -198,6 +198,16 @@ Scrub free RAM during boot. This is a safety feature to prevent accidentally leaking sensitive VM data into other VMs if Xen crashes and reboots. +### bootscrub_chunk_ +> `= <size>` + +> Default: `128MB` + +Maximum RAM block size chunks to be scrubbed whilst holding the page heap lock +and not running softirqs. Reduce this if softirqs are not being run frequently +enough. Setting this to a high value may cause cause boot failure, particularly +if the NMI watchdog is also enabled. + ### cachesize > `= <size>` diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 601319c..d7390b6 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -65,6 +65,13 @@ static bool_t opt_bootscrub __initdata = 1; boolean_param("bootscrub", opt_bootscrub); /* + * bootscrub_chunk -> Amount of bytes to scrub lockstep on non-SMT CPUs + * on all NUMA nodes. + */ +static unsigned long __initdata opt_bootscrub_chunk = MB(128); +size_param("bootscrub_chunk", opt_bootscrub_chunk); + +/* * Bit width of the DMA heap -- used to override NUMA-node-first. * allocation strategy, which can otherwise exhaust low memory. */ @@ -90,6 +97,16 @@ static struct bootmem_region { } *__initdata bootmem_region_list; static unsigned int __initdata nr_bootmem_regions; +struct scrub_region { + unsigned long offset; + unsigned long start; + unsigned long per_cpu_sz; + unsigned long rem; + cpumask_t cpus; +}; +static struct scrub_region __initdata region[MAX_NUMNODES]; +static unsigned long __initdata chunk_size; + static void __init boot_bug(int line) { panic("Boot BUG at %s:%d", __FILE__, line); @@ -1256,42 +1273,197 @@ void __init end_boot_allocator(void) printk("\n"); } +static void __init smp_scrub_heap_pages(void *data) +{ + unsigned long mfn, start, end; + struct page_info *pg; + struct scrub_region *r; + unsigned int temp_cpu, node, cpu_idx = 0; + unsigned int cpu = smp_processor_id(); + + if ( data ) + r = data; + else + { + node = cpu_to_node(cpu); + if ( node == NUMA_NO_NODE ) + return; + r = ®ion[node]; + } + + /* Determine the current CPU's index into CPU's linked to this node. */ + for_each_cpu ( temp_cpu, &r->cpus ) + { + if ( cpu == temp_cpu ) + break; + cpu_idx++; + } + + /* Calculate the starting mfn for this CPU's memory block. */ + start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset; + + /* Calculate the end mfn into this CPU's memory block for this iteration. */ + if ( r->offset + chunk_size >= r->per_cpu_sz ) + { + end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz; + + if ( r->rem && ((cpumask_weight(&r->cpus) - 1 == cpu_idx )) ) + end += r->rem; + } + else + end = start + chunk_size; + + for ( mfn = start; mfn < end; mfn++ ) + { + pg = mfn_to_page(mfn); + + /* Check the mfn is valid and page is free. */ + if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) + continue; + + scrub_one_page(pg); + } +} + +static int __init find_non_smt(unsigned int node, cpumask_t *dest) +{ + cpumask_t node_cpus; + unsigned int i, cpu; + + cpumask_and(&node_cpus, &node_to_cpumask(node), &cpu_online_map); + cpumask_clear(dest); + for_each_cpu ( i, &node_cpus ) + { + if ( cpumask_intersects(dest, per_cpu(cpu_sibling_mask, i)) ) + continue; + cpu = cpumask_first(per_cpu(cpu_sibling_mask, i)); + cpumask_set_cpu(cpu, dest); + } + return cpumask_weight(dest); +} + /* - * Scrub all unallocated pages in all heap zones. This function is more - * convoluted than appears necessary because we do not want to continuously - * hold the lock while scrubbing very large memory areas. + * Scrub all unallocated pages in all heap zones. This function uses all + * online cpu's to scrub the memory in parallel. */ void __init scrub_heap_pages(void) { - unsigned long mfn; - struct page_info *pg; + cpumask_t node_cpus, all_worker_cpus; + unsigned int i, j; + unsigned long offset, max_per_cpu_sz = 0; + unsigned long start, end; + unsigned long rem = 0; + int last_distance, best_node; if ( !opt_bootscrub ) return; - printk("Scrubbing Free RAM: "); + cpumask_clear(&all_worker_cpus); + /* Scrub block size. */ + chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT; + if ( chunk_size == 0 ) + chunk_size = MB(128); + + /* Round #0 - figure out amounts and which CPUs to use. */ + for_each_online_node ( i ) + { + if ( !node_spanned_pages(i) ) + continue; + /* Calculate Node memory start and end address. */ + start = max(node_start_pfn(i), first_valid_mfn); + end = min(node_start_pfn(i) + node_spanned_pages(i), max_page); + /* CPUs that are online and on this node (if none, that it is OK). */ + find_non_smt(i, &node_cpus); + cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus); + if ( cpumask_empty(&node_cpus) ) + { + /* No CPUs on this node. Round #2 will take of it. */ + rem = 0; + region[i].per_cpu_sz = (end - start); + } else { + rem = (end - start) % cpumask_weight(&node_cpus); + region[i].per_cpu_sz = (end - start) / cpumask_weight(&node_cpus); + if ( region[i].per_cpu_sz > max_per_cpu_sz ) + max_per_cpu_sz = region[i].per_cpu_sz; + } + region[i].start = start; + region[i].rem = rem; + cpumask_copy(®ion[i].cpus, &node_cpus); + + } + + printk("Scrubbing Free RAM on %u nodes using %u CPUs\n", num_online_nodes(), + cpumask_weight(&all_worker_cpus)); - for ( mfn = first_valid_mfn; mfn < max_page; mfn++ ) + /* Round: #1 - do NUMA nodes with CPUs. */ + for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size ) { + for_each_online_node ( i ) + region[i].offset = offset; + process_pending_softirqs(); - pg = mfn_to_page(mfn); + spin_lock(&heap_lock); + on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1); + spin_unlock(&heap_lock); - /* Quick lock-free check. */ - if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) + printk("."); + } + + /* + * Round #2: NUMA nodes with no CPUs get scrubbed with CPUs on the node + * closest to us and with CPUs. + */ + for_each_online_node ( i ) + { + node_cpus = node_to_cpumask(i); + + if ( !cpumask_empty(&node_cpus) ) continue; - /* Every 100MB, print a progress dot. */ - if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 ) - printk("."); + last_distance = INT_MAX; + best_node = first_node(node_online_map); + /* Figure out which NODE CPUs are close. */ + for_each_online_node ( j ) + { + int distance; - spin_lock(&heap_lock); + if ( i == j ) + continue; + distance = __node_distance(i, j); + if ( distance < last_distance ) + { + if ( cpumask_empty(&node_to_cpumask(j)) ) + continue; + last_distance = distance; + best_node = j; + } + } + /* + * Use CPUs from best node, and if there are no CPUs on the + * first node (the default) use the BSP. + */ + if ( find_non_smt(best_node, &node_cpus) == 0 ) + cpumask_set_cpu(smp_processor_id(), &node_cpus); - /* Re-check page status with lock held. */ - if ( page_state_is(pg, free) ) - scrub_one_page(pg); + /* We already have the node information from round #0. */ + region[i].rem = region[i].per_cpu_sz % cpumask_weight(&node_cpus); + region[i].per_cpu_sz /= cpumask_weight(&node_cpus); + max_per_cpu_sz = region[i].per_cpu_sz; + cpumask_copy(®ion[i].cpus, &node_cpus); - spin_unlock(&heap_lock); + for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size ) + { + region[i].offset = offset; + + process_pending_softirqs(); + + spin_lock(&heap_lock); + on_selected_cpus(&node_cpus, smp_scrub_heap_pages, ®ion[i], 1); + spin_unlock(&heap_lock); + + printk("."); + } } printk("done.\n"); diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h index cb8f2ba..2c019d7 100644 --- a/xen/include/asm-arm/numa.h +++ b/xen/include/asm-arm/numa.h @@ -12,6 +12,7 @@ static inline __attribute__((pure)) int phys_to_nid(paddr_t addr) /* XXX: implement NUMA support */ #define node_spanned_pages(nid) (total_pages) +#define node_start_pfn(nid) (frametable_base_mfn) #define __node_distance(a, b) (20) #endif /* __ARCH_ARM_NUMA_H */ -- 1.9.3 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) 2014-06-04 13:29 ` [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) Konrad Rzeszutek Wilk @ 2014-06-04 13:35 ` Andrew Cooper 2014-06-04 13:52 ` Konrad Rzeszutek Wilk 2014-06-05 10:09 ` Tim Deegan 2014-06-05 11:22 ` Jan Beulich 2 siblings, 1 reply; 7+ messages in thread From: Andrew Cooper @ 2014-06-04 13:35 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: dario.faggioli, julien.grall, tim, JBeulich, xen-devel, Malcolm Crossley On 04/06/14 14:29, Konrad Rzeszutek Wilk wrote: > diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown > index 514c7b2..39a67be 100644 > --- a/docs/misc/xen-command-line.markdown > +++ b/docs/misc/xen-command-line.markdown > @@ -198,6 +198,16 @@ Scrub free RAM during boot. This is a safety feature to prevent > accidentally leaking sensitive VM data into other VMs if Xen crashes > and reboots. > > +### bootscrub_chunk_ > +> `= <size>` > + > +> Default: `128MB` The 'B' here is still erroneous. > printk("done.\n"); > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h > index cb8f2ba..2c019d7 100644 > --- a/xen/include/asm-arm/numa.h > +++ b/xen/include/asm-arm/numa.h > @@ -12,6 +12,7 @@ static inline __attribute__((pure)) int phys_to_nid(paddr_t addr) > > /* XXX: implement NUMA support */ > #define node_spanned_pages(nid) (total_pages) > +#define node_start_pfn(nid) (frametable_base_mfn) What is this supposed to achieve? ~Andrew > #define __node_distance(a, b) (20) > > #endif /* __ARCH_ARM_NUMA_H */ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) 2014-06-04 13:35 ` Andrew Cooper @ 2014-06-04 13:52 ` Konrad Rzeszutek Wilk 0 siblings, 0 replies; 7+ messages in thread From: Konrad Rzeszutek Wilk @ 2014-06-04 13:52 UTC (permalink / raw) To: Andrew Cooper Cc: dario.faggioli, julien.grall, tim, JBeulich, xen-devel, Malcolm Crossley On Wed, Jun 04, 2014 at 02:35:49PM +0100, Andrew Cooper wrote: > On 04/06/14 14:29, Konrad Rzeszutek Wilk wrote: > > diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown > > index 514c7b2..39a67be 100644 > > --- a/docs/misc/xen-command-line.markdown > > +++ b/docs/misc/xen-command-line.markdown > > @@ -198,6 +198,16 @@ Scrub free RAM during boot. This is a safety feature to prevent > > accidentally leaking sensitive VM data into other VMs if Xen crashes > > and reboots. > > > > +### bootscrub_chunk_ > > +> `= <size>` > > + > > +> Default: `128MB` > > The 'B' here is still erroneous. > > > printk("done.\n"); > > diff --git a/xen/include/asm-arm/numa.h b/xen/include/asm-arm/numa.h > > index cb8f2ba..2c019d7 100644 > > --- a/xen/include/asm-arm/numa.h > > +++ b/xen/include/asm-arm/numa.h > > @@ -12,6 +12,7 @@ static inline __attribute__((pure)) int phys_to_nid(paddr_t addr) > > > > /* XXX: implement NUMA support */ > > #define node_spanned_pages(nid) (total_pages) > > +#define node_start_pfn(nid) (frametable_base_mfn) > > What is this supposed to achieve? Make it scrub memory on ARM (which define a flat 1 NODE NUMA topology). Perhaps I am not understanding your question? > > ~Andrew > > > #define __node_distance(a, b) (20) > > > > #endif /* __ARCH_ARM_NUMA_H */ > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) 2014-06-04 13:29 ` [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) Konrad Rzeszutek Wilk 2014-06-04 13:35 ` Andrew Cooper @ 2014-06-05 10:09 ` Tim Deegan 2014-06-05 11:22 ` Jan Beulich 2 siblings, 0 replies; 7+ messages in thread From: Tim Deegan @ 2014-06-05 10:09 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: andrew.cooper3, dario.faggioli, julien.grall, JBeulich, xen-devel, Malcolm Crossley Hi, At 09:29 -0400 on 04 Jun (1401870576), Konrad Rzeszutek Wilk wrote: > + /* Scrub block size. */ > + chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT; > + if ( chunk_size == 0 ) > + chunk_size = MB(128); ITYM (MB(128) >> PAGE_SHIFT) here! Otherwise, Reviewed-by: Tim Deegan <tim@xen.org> Cheers, Tim. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) 2014-06-04 13:29 ` [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) Konrad Rzeszutek Wilk 2014-06-04 13:35 ` Andrew Cooper 2014-06-05 10:09 ` Tim Deegan @ 2014-06-05 11:22 ` Jan Beulich 2014-06-05 17:49 ` Konrad Rzeszutek Wilk 2 siblings, 1 reply; 7+ messages in thread From: Jan Beulich @ 2014-06-05 11:22 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: andrew.cooper3, dario.faggioli, julien.grall, tim, xen-devel, Malcolm Crossley >>> On 04.06.14 at 15:29, <konrad.wilk@oracle.com> wrote: > --- a/docs/misc/xen-command-line.markdown > +++ b/docs/misc/xen-command-line.markdown > @@ -198,6 +198,16 @@ Scrub free RAM during boot. This is a safety feature to prevent > accidentally leaking sensitive VM data into other VMs if Xen crashes > and reboots. > > +### bootscrub_chunk_ Looking at other examples in that file, underscores appear to need backslash escaping here. And I don't think the trailing one should be there. > +> `= <size>` > + > +> Default: `128MB` > + > +Maximum RAM block size chunks to be scrubbed whilst holding the page heap lock > +and not running softirqs. Reduce this if softirqs are not being run frequently > +enough. Setting this to a high value may cause cause boot failure, particularly Double "cause". > +static void __init smp_scrub_heap_pages(void *data) > +{ > + unsigned long mfn, start, end; > + struct page_info *pg; > + struct scrub_region *r; > + unsigned int temp_cpu, node, cpu_idx = 0; > + unsigned int cpu = smp_processor_id(); > + > + if ( data ) > + r = data; > + else > + { > + node = cpu_to_node(cpu); > + if ( node == NUMA_NO_NODE ) > + return; > + r = ®ion[node]; > + } > + > + /* Determine the current CPU's index into CPU's linked to this node. */ > + for_each_cpu ( temp_cpu, &r->cpus ) > + { > + if ( cpu == temp_cpu ) > + break; > + cpu_idx++; > + } > + > + /* Calculate the starting mfn for this CPU's memory block. */ > + start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset; > + > + /* Calculate the end mfn into this CPU's memory block for this > iteration. */ > + if ( r->offset + chunk_size >= r->per_cpu_sz ) > + { > + end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz; > + > + if ( r->rem && ((cpumask_weight(&r->cpus) - 1 == cpu_idx )) ) > + end += r->rem; > + } > + else > + end = start + chunk_size; > + > + for ( mfn = start; mfn < end; mfn++ ) > + { > + pg = mfn_to_page(mfn); > + > + /* Check the mfn is valid and page is free. */ > + if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) > + continue; > + > + scrub_one_page(pg); > + } > +} > + > +static int __init find_non_smt(unsigned int node, cpumask_t *dest) > +{ > + cpumask_t node_cpus; > + unsigned int i, cpu; > + > + cpumask_and(&node_cpus, &node_to_cpumask(node), &cpu_online_map); > + cpumask_clear(dest); > + for_each_cpu ( i, &node_cpus ) > + { > + if ( cpumask_intersects(dest, per_cpu(cpu_sibling_mask, i)) ) > + continue; > + cpu = cpumask_first(per_cpu(cpu_sibling_mask, i)); > + cpumask_set_cpu(cpu, dest); > + } > + return cpumask_weight(dest); > +} > + > /* > - * Scrub all unallocated pages in all heap zones. This function is more > - * convoluted than appears necessary because we do not want to continuously > - * hold the lock while scrubbing very large memory areas. > + * Scrub all unallocated pages in all heap zones. This function uses all > + * online cpu's to scrub the memory in parallel. > */ > void __init scrub_heap_pages(void) > { > - unsigned long mfn; > - struct page_info *pg; > + cpumask_t node_cpus, all_worker_cpus; > + unsigned int i, j; > + unsigned long offset, max_per_cpu_sz = 0; > + unsigned long start, end; > + unsigned long rem = 0; > + int last_distance, best_node; > > if ( !opt_bootscrub ) > return; > > - printk("Scrubbing Free RAM: "); > + cpumask_clear(&all_worker_cpus); > + /* Scrub block size. */ > + chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT; > + if ( chunk_size == 0 ) > + chunk_size = MB(128); > + > + /* Round #0 - figure out amounts and which CPUs to use. */ > + for_each_online_node ( i ) > + { > + if ( !node_spanned_pages(i) ) > + continue; > + /* Calculate Node memory start and end address. */ > + start = max(node_start_pfn(i), first_valid_mfn); This implies you're aware that possibly node_start_pfn(i) < first_valid_mfn. > + end = min(node_start_pfn(i) + node_spanned_pages(i), max_page); Which in turn means that this may yield end < start. Is all the rest of the code prepared to deal with this? At least the divide and modulo operations on the difference further down doesn't look like so. > + /* CPUs that are online and on this node (if none, that it is OK). */ > + find_non_smt(i, &node_cpus); Here you could latch the weight, avoiding ... > + cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus); > + if ( cpumask_empty(&node_cpus) ) ... the extra operation on the mask here and the explicit use of cpumask_weight() on node_cpus in the else path. > + { > + /* No CPUs on this node. Round #2 will take of it. */ > + rem = 0; > + region[i].per_cpu_sz = (end - start); > + } else { Coding style - this takes 3 lines. > + rem = (end - start) % cpumask_weight(&node_cpus); > + region[i].per_cpu_sz = (end - start) / cpumask_weight(&node_cpus); > + if ( region[i].per_cpu_sz > max_per_cpu_sz ) > + max_per_cpu_sz = region[i].per_cpu_sz; > + } > + region[i].start = start; > + region[i].rem = rem; > + cpumask_copy(®ion[i].cpus, &node_cpus); > + > + } > + > + printk("Scrubbing Free RAM on %u nodes using %u CPUs\n", num_online_nodes(), > + cpumask_weight(&all_worker_cpus)); > > - for ( mfn = first_valid_mfn; mfn < max_page; mfn++ ) > + /* Round: #1 - do NUMA nodes with CPUs. */ > + for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size ) > { > + for_each_online_node ( i ) > + region[i].offset = offset; > + > process_pending_softirqs(); > > - pg = mfn_to_page(mfn); > + spin_lock(&heap_lock); > + on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1); > + spin_unlock(&heap_lock); > > - /* Quick lock-free check. */ > - if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) > + printk("."); > + } > + > + /* > + * Round #2: NUMA nodes with no CPUs get scrubbed with CPUs on the node > + * closest to us and with CPUs. > + */ > + for_each_online_node ( i ) > + { > + node_cpus = node_to_cpumask(i); > + > + if ( !cpumask_empty(&node_cpus) ) > continue; > > - /* Every 100MB, print a progress dot. */ > - if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 ) > - printk("."); > + last_distance = INT_MAX; > + best_node = first_node(node_online_map); > + /* Figure out which NODE CPUs are close. */ > + for_each_online_node ( j ) > + { > + int distance; > > - spin_lock(&heap_lock); > + if ( i == j ) > + continue; This could be replaced with cpumask_empty(&node_to_cpumask(j)), allowing ... > + distance = __node_distance(i, j); > + if ( distance < last_distance ) > + { > + if ( cpumask_empty(&node_to_cpumask(j)) ) > + continue; this check to be dropped. Jan ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) 2014-06-05 11:22 ` Jan Beulich @ 2014-06-05 17:49 ` Konrad Rzeszutek Wilk 0 siblings, 0 replies; 7+ messages in thread From: Konrad Rzeszutek Wilk @ 2014-06-05 17:49 UTC (permalink / raw) To: Jan Beulich Cc: andrew.cooper3, dario.faggioli, julien.grall, tim, xen-devel, Malcolm Crossley On Thu, Jun 05, 2014 at 12:22:41PM +0100, Jan Beulich wrote: > >>> On 04.06.14 at 15:29, <konrad.wilk@oracle.com> wrote: > > --- a/docs/misc/xen-command-line.markdown > > +++ b/docs/misc/xen-command-line.markdown > > @@ -198,6 +198,16 @@ Scrub free RAM during boot. This is a safety feature to prevent > > accidentally leaking sensitive VM data into other VMs if Xen crashes > > and reboots. > > > > +### bootscrub_chunk_ > > Looking at other examples in that file, underscores appear to need > backslash escaping here. And I don't think the trailing one should > be there. > > > +> `= <size>` > > + > > +> Default: `128MB` > > + > > +Maximum RAM block size chunks to be scrubbed whilst holding the page heap lock > > +and not running softirqs. Reduce this if softirqs are not being run frequently > > +enough. Setting this to a high value may cause cause boot failure, particularly > > Double "cause". > > > +static void __init smp_scrub_heap_pages(void *data) > > +{ > > + unsigned long mfn, start, end; > > + struct page_info *pg; > > + struct scrub_region *r; > > + unsigned int temp_cpu, node, cpu_idx = 0; > > + unsigned int cpu = smp_processor_id(); > > + > > + if ( data ) > > + r = data; > > + else > > + { > > + node = cpu_to_node(cpu); > > + if ( node == NUMA_NO_NODE ) > > + return; > > + r = ®ion[node]; > > + } > > + > > + /* Determine the current CPU's index into CPU's linked to this node. */ > > + for_each_cpu ( temp_cpu, &r->cpus ) > > + { > > + if ( cpu == temp_cpu ) > > + break; > > + cpu_idx++; > > + } > > + > > + /* Calculate the starting mfn for this CPU's memory block. */ > > + start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset; > > + > > + /* Calculate the end mfn into this CPU's memory block for this > > iteration. */ > > + if ( r->offset + chunk_size >= r->per_cpu_sz ) > > + { > > + end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz; > > + > > + if ( r->rem && ((cpumask_weight(&r->cpus) - 1 == cpu_idx )) ) > > + end += r->rem; > > + } > > + else > > + end = start + chunk_size; > > + > > + for ( mfn = start; mfn < end; mfn++ ) > > + { > > + pg = mfn_to_page(mfn); > > + > > + /* Check the mfn is valid and page is free. */ > > + if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) > > + continue; > > + > > + scrub_one_page(pg); > > + } > > +} > > + > > +static int __init find_non_smt(unsigned int node, cpumask_t *dest) > > +{ > > + cpumask_t node_cpus; > > + unsigned int i, cpu; > > + > > + cpumask_and(&node_cpus, &node_to_cpumask(node), &cpu_online_map); > > + cpumask_clear(dest); > > + for_each_cpu ( i, &node_cpus ) > > + { > > + if ( cpumask_intersects(dest, per_cpu(cpu_sibling_mask, i)) ) > > + continue; > > + cpu = cpumask_first(per_cpu(cpu_sibling_mask, i)); > > + cpumask_set_cpu(cpu, dest); > > + } > > + return cpumask_weight(dest); > > +} > > + > > /* > > - * Scrub all unallocated pages in all heap zones. This function is more > > - * convoluted than appears necessary because we do not want to continuously > > - * hold the lock while scrubbing very large memory areas. > > + * Scrub all unallocated pages in all heap zones. This function uses all > > + * online cpu's to scrub the memory in parallel. > > */ > > void __init scrub_heap_pages(void) > > { > > - unsigned long mfn; > > - struct page_info *pg; > > + cpumask_t node_cpus, all_worker_cpus; > > + unsigned int i, j; > > + unsigned long offset, max_per_cpu_sz = 0; > > + unsigned long start, end; > > + unsigned long rem = 0; > > + int last_distance, best_node; > > > > if ( !opt_bootscrub ) > > return; > > > > - printk("Scrubbing Free RAM: "); > > + cpumask_clear(&all_worker_cpus); > > + /* Scrub block size. */ > > + chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT; > > + if ( chunk_size == 0 ) > > + chunk_size = MB(128); > > + > > + /* Round #0 - figure out amounts and which CPUs to use. */ > > + for_each_online_node ( i ) > > + { > > + if ( !node_spanned_pages(i) ) > > + continue; > > + /* Calculate Node memory start and end address. */ > > + start = max(node_start_pfn(i), first_valid_mfn); > > This implies you're aware that possibly node_start_pfn(i) < > first_valid_mfn. > > > + end = min(node_start_pfn(i) + node_spanned_pages(i), max_page); > > Which in turn means that this may yield end < start. Is all the rest > of the code prepared to deal with this? At least the divide and > modulo operations on the difference further down doesn't look like > so. It will loop forever. I think adding end = max(end, start); Would fix it. > > > + /* CPUs that are online and on this node (if none, that it is OK). */ > > + find_non_smt(i, &node_cpus); > > Here you could latch the weight, avoiding ... > > > + cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus); > > + if ( cpumask_empty(&node_cpus) ) > > ... the extra operation on the mask here and the explicit use of > cpumask_weight() on node_cpus in the else path. <nods> > > > + { > > + /* No CPUs on this node. Round #2 will take of it. */ > > + rem = 0; > > + region[i].per_cpu_sz = (end - start); > > + } else { > > Coding style - this takes 3 lines. > > > + rem = (end - start) % cpumask_weight(&node_cpus); > > + region[i].per_cpu_sz = (end - start) / cpumask_weight(&node_cpus); > > + if ( region[i].per_cpu_sz > max_per_cpu_sz ) > > + max_per_cpu_sz = region[i].per_cpu_sz; > > + } > > + region[i].start = start; > > + region[i].rem = rem; > > + cpumask_copy(®ion[i].cpus, &node_cpus); > > + > > + } > > + > > + printk("Scrubbing Free RAM on %u nodes using %u CPUs\n", num_online_nodes(), > > + cpumask_weight(&all_worker_cpus)); > > > > - for ( mfn = first_valid_mfn; mfn < max_page; mfn++ ) > > + /* Round: #1 - do NUMA nodes with CPUs. */ > > + for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size ) > > { > > + for_each_online_node ( i ) > > + region[i].offset = offset; > > + > > process_pending_softirqs(); > > > > - pg = mfn_to_page(mfn); > > + spin_lock(&heap_lock); > > + on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1); > > + spin_unlock(&heap_lock); > > > > - /* Quick lock-free check. */ > > - if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) > > + printk("."); > > + } > > + > > + /* > > + * Round #2: NUMA nodes with no CPUs get scrubbed with CPUs on the node > > + * closest to us and with CPUs. > > + */ > > + for_each_online_node ( i ) > > + { > > + node_cpus = node_to_cpumask(i); > > + > > + if ( !cpumask_empty(&node_cpus) ) > > continue; > > > > - /* Every 100MB, print a progress dot. */ > > - if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 ) > > - printk("."); > > + last_distance = INT_MAX; > > + best_node = first_node(node_online_map); > > + /* Figure out which NODE CPUs are close. */ > > + for_each_online_node ( j ) > > + { > > + int distance; > > > > - spin_lock(&heap_lock); > > + if ( i == j ) > > + continue; > > This could be replaced with cpumask_empty(&node_to_cpumask(j)), > allowing ... > > > + distance = __node_distance(i, j); > > + if ( distance < last_distance ) > > + { > > + if ( cpumask_empty(&node_to_cpumask(j)) ) > > + continue; > > this check to be dropped. Clever! Will do. > > Jan ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-06-05 17:49 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-06-04 13:29 [PATCH v5] Spread boot time scrubbing across available CPUs Konrad Rzeszutek Wilk 2014-06-04 13:29 ` [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v5) Konrad Rzeszutek Wilk 2014-06-04 13:35 ` Andrew Cooper 2014-06-04 13:52 ` Konrad Rzeszutek Wilk 2014-06-05 10:09 ` Tim Deegan 2014-06-05 11:22 ` Jan Beulich 2014-06-05 17:49 ` Konrad Rzeszutek Wilk
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).