Re: [PATCH v2] Xen: Spread boot time page scrubbing across all available CPU's

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Malcolm Crossley <malcolm.crossley@citrix.com>
Cc: keir@xen.org, andrew.cooper3@citrix.com, tim@xen.org,
	JBeulich@suse.com, xen-devel@lists.xen.org
Subject: Re: [PATCH v2] Xen: Spread boot time page scrubbing across all available CPU's
Date: Mon, 30 Sep 2013 13:43:33 -0400	[thread overview]
Message-ID: <20130930174333.GA3106@phenom.dumpdata.com> (raw)
In-Reply-To: <ee1108d26fc5f8c2b44a.1380544517@malcolmc.uk.xensource.com>

On Mon, Sep 30, 2013 at 01:35:17PM +0100, Malcolm Crossley wrote:
> The page scrubbing is done in 128MB chunks in lockstep across all the CPU's.
> This allows for the boot CPU to hold the heap_lock whilst each chunk is being
> scrubbed and then release the heap_lock when all CPU's are finished scrubing
> their individual chunk. This allows for the heap_lock to not be held
> continously and for pending softirqs are to be serviced periodically across
> all CPU's.
> 
> The page scrub memory chunks are allocated to the CPU's in a NUMA aware
> fashion to reduce Socket interconnect overhead and improve performance.
> 
> This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron
> 6386 machine from 49 seconds to 3 seconds.

A bit older version of this one cut down the 1TB machine scrubbing from minutes
(I think it was 5 or 10 - I gave up on counting) down to less than a minute.

> 
> Changes in v2
>  - Reduced default chunk size to 128MB
>  - Added code to scrub NUMA nodes with no active CPU linked to them
>  - Be robust to boot CPU not being linked to a NUMA node
> 
> diff -r a03cc3136759 -r ee1108d26fc5 docs/misc/xen-command-line.markdown
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -188,6 +188,16 @@ Scrub free RAM during boot.  This is a s
>  accidentally leaking sensitive VM data into other VMs if Xen crashes
>  and reboots.
>  
> +### bootscrub_blocksize
> +> `= <size>`
> +
> +> Default: `128MiB`
> +
> +Maximum RAM block size to be scrubbed whilst holding the page heap lock and not
> +running softirqs. Reduce this if softirqs are not being run frequently enough.
> +Setting this to a high value may cause cause boot failure, particularly if the
> +NMI watchdog is also enabled.
> +
>  ### cachesize
>  > `= <size>`
>  
> diff -r a03cc3136759 -r ee1108d26fc5 xen/common/page_alloc.c
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -65,6 +65,12 @@ static bool_t opt_bootscrub __initdata =
>  boolean_param("bootscrub", opt_bootscrub);
>  
>  /*
> + * bootscrub_blocksize -> Size (bytes) of mem block to scrub with heaplock held
> + */
> +static unsigned int __initdata opt_bootscrub_blocksize = 128 * 1024 * 1024;
> +size_param("bootscrub_blocksize", opt_bootscrub_blocksize);
> +
> +/*
>   * Bit width of the DMA heap -- used to override NUMA-node-first.
>   * allocation strategy, which can otherwise exhaust low memory.
>   */
> @@ -90,6 +96,16 @@ static struct bootmem_region {
>  } *__initdata bootmem_region_list;
>  static unsigned int __initdata nr_bootmem_regions;
>  
> +static atomic_t __initdata bootscrub_count = ATOMIC_INIT(0);
> +
> +struct scrub_region {
> +    u64 offset;
> +    u64 start;
> +    u64 chunk_size;
> +    u64 cpu_block_size;
> +};
> +static struct scrub_region __initdata region[MAX_NUMNODES];
> +
>  static void __init boot_bug(int line)
>  {
>      panic("Boot BUG at %s:%d\n", __FILE__, line);
> @@ -1254,28 +1270,44 @@ void __init end_boot_allocator(void)
>      printk("\n");
>  }
>  
> -/*
> - * Scrub all unallocated pages in all heap zones. This function is more
> - * convoluted than appears necessary because we do not want to continuously
> - * hold the lock while scrubbing very large memory areas.
> - */
> -void __init scrub_heap_pages(void)
> +void __init smp_scrub_heap_pages(void *data)
>  {
> -    unsigned long mfn;
> +    unsigned long mfn, start_mfn, end_mfn;
>      struct page_info *pg;
> +    struct scrub_region *region = data;
> +    unsigned int temp_cpu, local_node, local_cpu_index = 0;
> +    unsigned int cpu = smp_processor_id();
>  
> -    if ( !opt_bootscrub )
> -        return;
> +    ASSERT(region != NULL);
>  
> -    printk("Scrubbing Free RAM: ");
> +    local_node = cpu_to_node(cpu);
> +    /* Determine if we are scrubbing using the boot CPU */
> +    if ( region->cpu_block_size != ~0ULL )
> +        /* Determine the current CPU's index into CPU's linked to this node*/
> +        for_each_cpu( temp_cpu, &node_to_cpumask(local_node) )
> +        {
> +            if ( cpu == temp_cpu )
> +                break;
> +            local_cpu_index++;
> +        }
>  
> -    for ( mfn = first_valid_mfn; mfn < max_page; mfn++ )
> +    /* Calculate the starting mfn for this CPU's memory block */
> +    start_mfn = region->start + (region->cpu_block_size * local_cpu_index)
> +                + region->offset;
> +
> +    /* Calculate the end mfn into this CPU's memory block for this iteration */
> +    if ( region->offset + region->chunk_size > region->cpu_block_size )
> +        end_mfn = region->start + (region->cpu_block_size * local_cpu_index)
> +                  + region->cpu_block_size;
> +    else
> +        end_mfn = start_mfn + region->chunk_size;
> +
> +
> +    for ( mfn = start_mfn; mfn < end_mfn; mfn++ )
>      {
> -        process_pending_softirqs();
> -
>          pg = mfn_to_page(mfn);
>  
> -        /* Quick lock-free check. */
> +        /* Check the mfn is valid and page is free. */
>          if ( !mfn_valid(mfn) || !page_state_is(pg, free) )
>              continue;
>  
> @@ -1283,15 +1315,124 @@ void __init scrub_heap_pages(void)
>          if ( (mfn % ((100*1024*1024)/PAGE_SIZE)) == 0 )
>              printk(".");
>  
> +        /* Do the scrub if possible */
> +        if ( page_state_is(pg, free) )
> +            scrub_one_page(pg);
> +    }
> +    /* Increment count to indicate scrubbing complete on this CPU */
> +    atomic_dec(&bootscrub_count);
> +}
> +
> +/*
> + * Scrub all unallocated pages in all heap zones. This function uses all
> + * online cpu's to scrub the memory in parallel.
> + */
> +void __init scrub_heap_pages(void)
> +{
> +    cpumask_t node_cpus, total_node_cpus_mask = {{ 0 }};
> +    unsigned int i, boot_cpu_node, total_node_cpus, cpu = smp_processor_id();
> +    unsigned long mfn, mfn_off, chunk_size, max_cpu_blk_size = 0;
> +    unsigned long mem_start, mem_end;
> +
> +    if ( !opt_bootscrub )
> +        return;
> +
> +    boot_cpu_node = cpu_to_node(cpu);
> +
> +    printk("Scrubbing Free RAM: ");
> +
> +    /* Scrub block size */
> +    chunk_size = opt_bootscrub_blocksize >> PAGE_SHIFT;
> +    if ( chunk_size == 0 )
> +        chunk_size = 1;
> +
> +    /* Determine the amount of memory to scrub, per CPU on each Node */
> +    for_each_online_node ( i )
> +    {
> +        /* Calculate Node memory start and end address */
> +        mem_start = max(node_start_pfn(i), first_valid_mfn);
> +        mem_end = min(mem_start + node_spanned_pages(i), max_page);
> +        /* Divide by number of CPU's for this node */
> +        node_cpus = node_to_cpumask(i);
> +        /* It's possible a node has no CPU's */
> +        if ( cpumask_empty(&node_cpus) )
> +            continue;
> +        cpumask_or(&total_node_cpus_mask, &total_node_cpus_mask, &node_cpus);
> +
> +        region[i].cpu_block_size = (mem_end - mem_start) /
> +                                    cpumask_weight(&node_cpus);
> +        region[i].start = mem_start;
> +
> +        if ( region[i].cpu_block_size > max_cpu_blk_size )
> +            max_cpu_blk_size = region[i].cpu_block_size;
> +    }
> +
> +    /* Round default chunk size down if required */
> +    if ( max_cpu_blk_size && chunk_size > max_cpu_blk_size )
> +        chunk_size = max_cpu_blk_size;
> +
> +    total_node_cpus = cpumask_weight(&total_node_cpus_mask);
> +    /* Start all CPU's scrubbing memory, chunk_size at a time */
> +    for ( mfn_off = 0; mfn_off < max_cpu_blk_size; mfn_off += chunk_size )
> +    {
> +        process_pending_softirqs();
> +
> +        atomic_set(&bootscrub_count, total_node_cpus);
> +
>          spin_lock(&heap_lock);
>  
> -        /* Re-check page status with lock held. */
> -        if ( page_state_is(pg, free) )
> -            scrub_one_page(pg);
> +        /* Start all other CPU's on all nodes */
> +        for_each_online_node ( i )
> +        {
> +            region[i].chunk_size = chunk_size;
> +            region[i].offset = mfn_off;
> +            node_cpus = node_to_cpumask(i);
> +            /* Clear local cpu ID */
> +            cpumask_clear_cpu(cpu, &node_cpus);
> +            /* Start page scrubbing on all other CPU's */
> +            on_selected_cpus(&node_cpus, smp_scrub_heap_pages, &region[i], 0);
> +        }
> +
> +        /* Start scrub on local CPU if CPU linked to a memory node */
> +        if ( boot_cpu_node != NUMA_NO_NODE )
> +            smp_scrub_heap_pages(&region[boot_cpu_node]);
> +
> +        /* Wait for page scrubbing to complete on all other CPU's */
> +        while ( atomic_read(&bootscrub_count) > 0 )
> +            cpu_relax();
>  
>          spin_unlock(&heap_lock);
>      }
>  
> +    /* Use the boot CPU to scrub any nodes which have no CPU's linked to them */
> +    for_each_online_node ( i )
> +    {
> +        node_cpus = node_to_cpumask(i);
> +
> +        if ( !cpumask_empty(&node_cpus) )
> +            continue;
> +
> +        mem_start = max(node_start_pfn(i), first_valid_mfn);
> +        mem_end = min(mem_start + node_spanned_pages(i), max_page);
> +
> +        region[0].offset = 0;
> +        region[0].cpu_block_size = ~0ULL;
> +
> +        for ( mfn = mem_start; mfn < mem_end; mfn += chunk_size )
> +        {
> +            spin_lock(&heap_lock);
> +            if ( mfn + chunk_size > mem_end )
> +                region[0].chunk_size = mem_end - mfn;
> +            else
> +                region[0].chunk_size = chunk_size;
> +
> +            region[0].start = mfn;
> +
> +            smp_scrub_heap_pages(&region[0]);
> +            spin_unlock(&heap_lock);
> +            process_pending_softirqs();
> +        }
> +    }
>      printk("done.\n");
>  
>      /* Now that the heap is initialized, run checks and set bounds

next prev parent reply	other threads:[~2013-09-30 17:43 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-30 12:35 [PATCH v2] Xen: Spread boot time page scrubbing across all available CPU's Malcolm Crossley
2013-09-30 13:26 ` Jan Beulich
2013-09-30 13:56   ` Malcolm Crossley
2013-09-30 15:35     ` Jan Beulich
2013-09-30 15:42       ` Andrew Cooper
2013-09-30 16:08         ` Jan Beulich
2013-09-30 17:43 ` Konrad Rzeszutek Wilk [this message]
2013-10-03 11:39 ` Tim Deegan
2014-04-01 19:29 ` Konrad Rzeszutek Wilk
2014-04-03  1:19   ` Konrad Rzeszutek Wilk
2014-04-03  8:35     ` Jan Beulich
2014-04-03  9:00     ` Tim Deegan
2014-04-09 14:20       ` Konrad Rzeszutek Wilk
2014-04-10 11:09         ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130930174333.GA3106@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=JBeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=keir@xen.org \
    --cc=malcolm.crossley@citrix.com \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.