All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] Xen: Spread boot time page scrubbing across all available CPUs
@ 2014-04-11 18:08 Konrad Rzeszutek Wilk
  2014-04-11 18:08 ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-11 18:08 UTC (permalink / raw)
  To: dario.faggioli, tim, keir, andrew.cooper3, xen-devel, JBeulich,
	malcolm.crossley

Hey,

Please see v3 of this patchset. It should have all review comments addressed.
I did change the algorithm for the NUMA-node-but-no-CPUs code. Instead of
just using one CPU (the bootstrap) it spreads the work across all of the
CPUs. The first pass - retains the same algorithm.

 docs/misc/xen-command-line.markdown |   10 ++
 xen/common/page_alloc.c             |  177 +++++++++++++++++++++++++++++++----
 
Malcolm Crossley (1):
      Xen: Spread boot time page scrubbing across all available CPU's

^ permalink raw reply	[flat|nested] 13+ messages in thread
* Re: [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's
@ 2014-04-12 10:41 Konrad Rzeszutek Wilk
  2014-04-13 22:45 ` Tim Deegan
  0 siblings, 1 reply; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-12 10:41 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: keir, dario.faggioli, tim, xen-devel, Konrad Rzeszutek Wilk,
	JBeulich, malcolm.crossley


On Apr 11, 2014 2:38 PM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 11/04/14 19:08, Konrad Rzeszutek Wilk wrote: 
> > From: Malcolm Crossley <malcolm.crossley@citrix.com> 
> > 
> > The page scrubbing is done in 128MB chunks in lockstep across all the CPU's. 
>
> No apostrophe in plural CPUs (also further through) 
>
> > This allows for the boot CPU to hold the heap_lock whilst each chunk is being 
> > scrubbed and then release the heap_lock when all CPU's are finished scrubing 
> > their individual chunk. This allows for the heap_lock to not be held 
> > continously and for pending softirqs are to be serviced periodically across 
> > all CPU's. 
> > 
> > The page scrub memory chunks are allocated to the CPU's in a NUMA aware 
> > fashion to reduce socket interconnect overhead and improve performance. 
> > Specifically in the first phase we scrub at the same time on all the 
> > NUMA nodes that have CPUs - we also weed out the SMT threads so that 
> > we only use cores (that gives a 50% boost). The second phase is to use 
> > all of the CPUs for the NUMA nodes that have no CPUs. 
> > 
> > This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron 
> > 6386 machine from 49 seconds to 3 seconds. 
> > On a IvyBridge-EX 8 socket box with 1.5TB it cuts it down from 15 minutes 
> > to 117 seconds. 
>
> The numbers from our 1TB box are also along a similar line, but I don't 
> have them to hand. 
>
> > 
> > v2 
> >  - Reduced default chunk size to 128MB 
> >  - Added code to scrub NUMA nodes with no active CPU linked to them 
> >  - Be robust to boot CPU not being linked to a NUMA node 
> > 
> > v3: 
> >  - Don't use SMT threads 
> >  - Take care of remainder if the number of CPUs (or memory) is odd 
> >  - Restructure the worker thread 
> >  - s/u64/unsigned long/ 
> > 
> > Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com> 
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 
> > --- 
> >  docs/misc/xen-command-line.markdown |   10 ++ 
> >  xen/common/page_alloc.c             |  177 +++++++++++++++++++++++++++++++---- 
> >  2 files changed, 167 insertions(+), 20 deletions(-) 
> > 
> > diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown 
> > index 87de2dc..a7da227 100644 
> > --- a/docs/misc/xen-command-line.markdown 
> > +++ b/docs/misc/xen-command-line.markdown 
> > @@ -197,6 +197,16 @@ Scrub free RAM during boot.  This is a safety feature to prevent 
> >  accidentally leaking sensitive VM data into other VMs if Xen crashes 
> >  and reboots. 
> >  
> > +### bootscrub_chunk_ 
>
> Erronious trailing underscore, and the first one needs escaping for the 
> markdown to end up properly formatted. 
>
> > +> `= <size>` 
> > + 
> > +> Default: `128MiB` 
>
> `128MB` 
>
> 128MiB will be rejected by the size parsing code

Yes. Will correct.
>
> > + 
> > +Maximum RAM block size chunks to be scrubbed whilst holding the page heap lock 
> > +and not running softirqs. Reduce this if softirqs are not being run frequently 
> > +enough. Setting this to a high value may cause cause boot failure, particularly 
> > +if the NMI watchdog is also enabled. 
> > + 
> >  ### cachesize 
> >  > `= <size>` 
> >  
> > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c 
> > index 601319c..3ad6e1d 100644 
> > --- a/xen/common/page_alloc.c 
> > +++ b/xen/common/page_alloc.c 
> > @@ -65,6 +65,12 @@ static bool_t opt_bootscrub __initdata = 1; 
> >  boolean_param("bootscrub", opt_bootscrub); 
> >  
> >  /* 
> > + * bootscrub_blocksize -> Size (bytes) of mem block to scrub with heaplock held 
>
> Stale comment?

Aye
>
> > + */ 
> > +static unsigned int __initdata opt_bootscrub_chunk = 128 * 1024 * 1024; 
> > +size_param("bootscrub_chunk", opt_bootscrub_chunk); 
> > + 
> > +/* 
> >   * Bit width of the DMA heap -- used to override NUMA-node-first. 
> >   * allocation strategy, which can otherwise exhaust low memory. 
> >   */ 
> > @@ -90,6 +96,16 @@ static struct bootmem_region { 
> >  } *__initdata bootmem_region_list; 
> >  static unsigned int __initdata nr_bootmem_regions; 
> >  
> > +struct scrub_region { 
> > +    unsigned long offset; 
> > +    unsigned long start; 
> > +    unsigned long per_cpu_sz; 
> > +    unsigned long rem; 
> > +    cpumask_t cpu; 
>
> cpus surely? 
>
> > +}; 
> > +static struct scrub_region __initdata region[MAX_NUMNODES]; 
> > +static unsigned long __initdata chunk_size; 
> > + 
> >  static void __init boot_bug(int line) 
> >  { 
> >      panic("Boot BUG at %s:%d", __FILE__, line); 
> > @@ -1256,45 +1272,166 @@ void __init end_boot_allocator(void) 
> >      printk("\n"); 
> >  } 
> >  
> > +void __init smp_scrub_heap_pages(void *data) 
> > +{ 
> > +    unsigned long mfn, start, end; 
> > +    struct page_info *pg; 
> > +    struct scrub_region *r; 
> > +    unsigned int temp_cpu, node, cpu_idx = 0; 
> > +    unsigned int cpu = smp_processor_id(); 
> > + 
> > +    if ( data ) 
> > +        r = data; 
> > +    else { 
> > +        node = cpu_to_node(cpu); 
> > +        if ( node == NUMA_NO_NODE ) 
> > +            return; 
> > +        r = &region[node]; 
> > +    } 
> > +    ASSERT(r != NULL); 
>
> Under what conditions would NULL be passed?  Can't the caller do 
> something more sane? 
>

Left over code from v2.
> > + 
> > +    /* Determine the current CPU's index into CPU's linked to this node*/ 
>
> Space at the end of the comment 
>
> > +    for_each_cpu ( temp_cpu, &r->cpu ) 
> > +    { 
> > +        if ( cpu == temp_cpu ) 
> > +            break; 
> > +        cpu_idx++; 
> > +    } 
>
> Is this really the best way to do this?  perhaps, but it absolutely 
> should be calculated once on the first chunk scrubbed rather than for 
> each chunk.  Or better yet, probably calculated by the BSP when it is 
> splitting up RAM.

Where would you store this? The region is per-node. We would need to allocate an CPUs array to store the cpu-idx values. I am not sure if it is worth that considering this code only runs at boot up.
>
> > + 
> > +    /* Calculate the starting mfn for this CPU's memory block */ 
> > +    start = r->start + (r->per_cpu_sz * cpu_idx) + r->offset; 
> > + 
> > +    /* Calculate the end mfn into this CPU's memory block for this iteration */ 
> > +    if ( r->offset + chunk_size > r->per_cpu_sz ) { 
> > +        end = r->start + (r->per_cpu_sz * cpu_idx) + r->per_cpu_sz; 
> > +        if ( r->rem && ((cpumask_weight(&r->cpu) - 1 == cpu_idx )) ) 
> > +            end += r->rem; 
> > +    } 
> > +    else 
> > +        end = start + chunk_size; 
> > + 
> > +    for ( mfn = start; mfn < end; mfn++ ) 
> > +    { 
> > +        pg = mfn_to_page(mfn); 
> > + 
> > +        /* Check the mfn is valid and page is free. */ 
> > +        if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) 
> > +            continue; 
> > + 
> > +        scrub_one_page(pg); 
> > +    } 
> > +    wmb(); 
>
> Why this barrier?

Tim asked for it in his review - see http://lists.xen.org/archives/html/xen-devel/2013-10/msg00131.html
" + scrub_one_page(pg); > + } There should be a wmb() here, to make sure the main scrub dispatcher can't exit while the last worker is still issuing writes."

But with the @wait = 1 this fix is not needed anymore. Will remove it.

>
> > +} 
> > + 
> >  /* 
> > - * Scrub all unallocated pages in all heap zones. This function is more 
> > - * convoluted than appears necessary because we do not want to continuously 
> > - * hold the lock while scrubbing very large memory areas. 
> > + * Scrub all unallocated pages in all heap zones. This function uses all 
> > + * online cpu's to scrub the memory in parallel. 
> >   */ 
> >  void __init scrub_heap_pages(void) 
> >  { 
> > -    unsigned long mfn; 
> > -    struct page_info *pg; 
> > +    cpumask_t node_cpus, temp_cpus, all_worker_cpus = {{ 0 }}; 
> > +    unsigned int i, j, cpu, sibling; 
> > +    unsigned long offset, max_per_cpu_sz = 0; 
> > +    unsigned long start, end; 
> > +    unsigned long rem = 0; 
> >  
> >      if ( !opt_bootscrub ) 
> >          return; 
> >  
> > -    printk("Scrubbing Free RAM: "); 
> > +    /* Scrub block size */ 
> > +    chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT; 
> > +    if ( chunk_size == 0 ) 
> > +        chunk_size = 1; 
> >  
> > -    for ( mfn = first_valid_mfn; mfn < max_page; mfn++ ) 
> > +    /* Round #0 - figure out amounts and which CPUs to use */ 
> > +    for_each_online_node ( i ) 
> >      { 
> > +        /* Calculate Node memory start and end address */ 
> > +        start = max(node_start_pfn(i), first_valid_mfn); 
> > +        end = min(node_start_pfn(i) + node_spanned_pages(i), max_page); 
> > +        /* CPUs that are online and on this node (if none, that it is OK */ 
> > +        cpumask_and(&node_cpus, &node_to_cpumask(i), &cpu_online_map); 
> > +        cpumask_copy(&temp_cpus, &node_cpus); 
> > +        /* Rip out threads. */ 
> > +        for_each_cpu ( j, &temp_cpus ) 
> > +        { 
> > +            cpu = 0; 
> > +            for_each_cpu(sibling, per_cpu(cpu_sibling_mask, j)) { 
> > +                if (cpu++ == 0) /* Skip 1st CPU - the core */ 
> > +                    continue; 
> > +                cpumask_clear_cpu(sibling, &node_cpus); 
> > +            } 
> > +        } 
> > +        cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus); 
> > +        if ( cpumask_empty(&node_cpus) ) { /* No CPUs on this node. */ 
> > +            rem = 0; 
> > +            region[i].per_cpu_sz = (end - start); 
> > +        } else { 
> > +            rem = (end - start) % cpumask_weight(&node_cpus); 
> > +            region[i].per_cpu_sz = (end - start) / cpumask_weight(&node_cpus); 
> > +            if ( region[i].per_cpu_sz > max_per_cpu_sz ) 
> > +                max_per_cpu_sz = region[i].per_cpu_sz; 
> > +        } 
> > +        region[i].start = start; 
> > +        region[i].rem = rem; 
> > +        cpumask_copy(&region[i].cpu, &node_cpus); 
> > + 
> > +    } 
> > +    cpu = smp_processor_id(); 
>
> This is surely always 0?, and I don't see it being used any further down.... 
>

Leftover from v2.

Thanks for looking!
> ~Andrew 
>
> > +    /* Round default chunk size down if required */ 
> > +    if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz ) 
> > +        chunk_size = max_per_cpu_sz; 
> > + 
> > +    printk("Scrubbing Free RAM on %u nodes using %u CPUs: ", num_online_nodes(), 
> > +           cpumask_weight(&all_worker_cpus)); 
> > + 
> > +    /* Round: #1 - do NUMA nodes with CPUs */ 
> > +    for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size ) 
> > +    { 
> > +        for_each_online_node ( i ) 
> > +            region[i].offset = offset; 
> > + 
> >          process_pending_softirqs(); 
> >  
> > -        pg = mfn_to_page(mfn); 
> > +        spin_lock(&heap_lock); 
> > +        on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, NULL, 1); 
> > +        spin_unlock(&heap_lock); 
> >  
> > -        /* Quick lock-free check. */ 
> > -        if ( !mfn_valid(mfn) || !page_state_is(pg, free) ) 
> > +        printk("."); 
> > +    } 
> > + 
> > +    /* Round #2: NUMA nodes with no CPUs get scrubbed with all CPUs. */ 
> > +    for_each_online_node ( i ) 
> > +    { 
> > +        node_cpus = node_to_cpumask(i); 
> > + 
> > +        if ( !cpumask_empty(&node_cpus) ) 
> >              continue; 
> >  
> > -        /* Every 100MB, print a progress dot. */ 
> > -        if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 ) 
> > -            printk("."); 
> > +        /* We already have the node information from round #0 */ 
> > +        end = region[i].start + region[i].per_cpu_sz; 
> > +        rem = region[i].per_cpu_sz % cpumask_weight(&all_worker_cpus); 
> >  
> > -        spin_lock(&heap_lock); 
> > +        region[i].rem = rem; 
> > +        region[i].per_cpu_sz /= cpumask_weight(&all_worker_cpus); 
> > +        max_per_cpu_sz = region[i].per_cpu_sz; 
> > +        if ( max_per_cpu_sz && chunk_size > max_per_cpu_sz ) 
> > +            chunk_size = max_per_cpu_sz; 
> > +        cpumask_copy(&region[i].cpu, &all_worker_cpus); 
> >  
> > -        /* Re-check page status with lock held. */ 
> > -        if ( page_state_is(pg, free) ) 
> > -            scrub_one_page(pg); 
> > +        for ( offset = 0; offset < max_per_cpu_sz; offset += chunk_size ) 
> > +        { 
> > +            region[i].offset = offset; 
> >  
> > -        spin_unlock(&heap_lock); 
> > -    } 
> > +            process_pending_softirqs(); 
> > + 
> > +            spin_lock(&heap_lock); 
> > +            on_selected_cpus(&all_worker_cpus, smp_scrub_heap_pages, &region[i], 1); 
> > +            spin_unlock(&heap_lock); 
> >  
> > -    printk("done.\n"); 
> > +            printk("."); 
> > +        } 
> > +    } 
> >  
> >      /* Now that the heap is initialized, run checks and set bounds 
> >       * for the low mem virq algorithm. */ 
>
>
> _______________________________________________ 
> Xen-devel mailing list 
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-04-14 15:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-11 18:08 [PATCH v3] Xen: Spread boot time page scrubbing across all available CPUs Konrad Rzeszutek Wilk
2014-04-11 18:08 ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Konrad Rzeszutek Wilk
2014-04-11 18:38   ` Andrew Cooper
2014-04-11 19:19   ` Julien Grall
2014-04-12 13:19     ` Konrad Rzeszutek Wilk
2014-04-13 22:38   ` Tim Deegan
2014-04-14  9:39   ` Jan Beulich
2014-04-14 15:04     ` Konrad Rzeszutek Wilk
2014-04-14 15:18       ` NUMA nodes with no cpus Andrew Cooper
2014-04-14 15:38       ` [PATCH v3] Xen: Spread boot time page scrubbing across all available CPU's Jan Beulich
2014-04-14 15:54         ` Konrad Rzeszutek Wilk
  -- strict thread matches above, loose matches on Subject: below --
2014-04-12 10:41 Konrad Rzeszutek Wilk
2014-04-13 22:45 ` Tim Deegan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.