Re: [PATCH v2 3/3] xen: use idle vcpus to scrub pages

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: "Jan Beulich" <JBeulich@suse.com>
To: Bob Liu <lliubbo@gmail.com>
Cc: keir@xen.org, ian.campbell@citrix.com,
	George.Dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	xen-devel@lists.xenproject.org
Subject: Re: [PATCH v2 3/3] xen: use idle vcpus to scrub pages
Date: Tue, 01 Jul 2014 10:12:28 +0100	[thread overview]
Message-ID: <53B2979C020000780001EE97@mail.emea.novell.com> (raw)
In-Reply-To: <1404135584-29206-3-git-send-email-bob.liu@oracle.com>

>>> On 30.06.14 at 15:39, <lliubbo@gmail.com> wrote:
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -116,6 +116,7 @@ static void idle_loop(void)
>      {
>          if ( cpu_is_offline(smp_processor_id()) )
>              play_dead();
> +        scrub_free_pages();
>          (*pm_idle)();

So is it really appropriate to call pm_idle() if scrub_free_pages() didn't
return immediately? I.e. I would suppose the function ought to return
a boolean indicator whether any meaningful amount of processing
was done, in which case we may want to skip entering any kind of C
state (even if it would end up being just C1), or doing any of the
processing associated with this possible entering.

> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index ab293c8..6ab1d1d 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -86,6 +86,12 @@ PAGE_LIST_HEAD(page_offlined_list);
>  /* Broken page list, protected by heap_lock. */
>  PAGE_LIST_HEAD(page_broken_list);
>  
> +/* A rough flag to indicate whether a node have need_scrub pages */
> +static bool_t node_need_scrub[MAX_NUMNODES];
> +static DEFINE_PER_CPU(bool_t, is_scrubbing);
> +static DEFINE_PER_CPU(struct page_list_head, scrub_list_cpu);
> +static DEFINE_PER_CPU(struct page_list_head, free_list_cpu);
> +
>  /*************************
>   * BOOT-TIME ALLOCATOR
>   */
> @@ -948,6 +954,7 @@ static void free_heap_pages(
>      {
>          if ( !tainted )
>          {
> +            node_need_scrub[node] = 1;
>              for ( i = 0; i < (1 << order); i++ )
>                  pg[i].count_info |= PGC_need_scrub;
>          }

Iirc it was more than this single place where you set
PGC_need_scrub, and hence where you'd now need to set the
other flag too.

> @@ -1525,7 +1532,130 @@ void __init scrub_heap_pages(void)
>      setup_low_mem_virq();
>  }
>  
> +#define SCRUB_BATCH_ORDER 12

Please make this adjustable via command line, so that eventual
latency issues can be worked around.

> +static void __scrub_free_pages(unsigned int node, unsigned int cpu)
> +{
> +    struct page_info *pg, *tmp;
> +    unsigned int i;
> +    int order;
> +    struct page_list_head *local_scrub_list = &this_cpu(scrub_list_cpu);
> +    struct page_list_head *local_free_list = &this_cpu(free_list_cpu);
> +
> +    /* Scrub percpu list */
> +    while ( !page_list_empty(local_scrub_list) )
> +    {
> +        pg = page_list_remove_head(local_scrub_list);
> +        order = PFN_ORDER(pg);
> +        ASSERT( pg && order <= SCRUB_BATCH_ORDER );
> +        for ( i = 0; i < (1 << order); i++ )
> +        {
> +            ASSERT( test_bit(_PGC_need_scrub, &pg[i].count_info) );
> +            scrub_one_page(&pg[i]);
> +        }
> +        page_list_add_tail(pg, local_free_list);
> +        if ( softirq_pending(cpu) )
> +		return;

Hard tabs. Please, also with further violations below in mind, check
your code before submitting.

> +    }
> +
> +    /* free percpu free list */
> +    if ( !page_list_empty(local_free_list) )
> +    {
> +        spin_lock(&heap_lock);
> +        page_list_for_each_safe( pg, tmp, local_free_list )
> +        {
> +            order = PFN_ORDER(pg);
> +            page_list_del(pg, local_free_list);
> +            for ( i = 0; i < (1 << order); i++ )
> +	    {
> +                pg[i].count_info |= PGC_state_free;
> +                pg[i].count_info &= ~PGC_need_scrub;

This needs to happen earlier - the scrub flag should be cleared right
after scrubbing, and the free flag should imo be set when the page
gets freed. That's for two reasons:
1) Hypervisor allocations don't need scrubbed pages, i.e. they can
allocate memory regardless of the scrub flag's state.
2) You still detain the memory on the local lists from allocation. On a
many-node system, the 16Mb per node can certainly sum up (which
is not to say that I don't view the 16Mb on a single node as already
problematic).

> +            }
> +            merge_free_trunks(pg, order, node, page_to_zone(pg), 0);
> +        }
> +        spin_unlock(&heap_lock);
> +    }
> +}
> +
> +void scrub_free_pages(void)
> +{
> +    int order;
> +    struct page_info *pg, *tmp;
> +    unsigned int i, zone, nr_delisted = 0;
> +    unsigned int cpu = smp_processor_id();
> +    unsigned int node = cpu_to_node(cpu);
> +    struct page_list_head *local_scrub_list = &this_cpu(scrub_list_cpu);
> +
> +    /* Return if our sibling already started scrubbing */
> +    for_each_cpu( i, per_cpu(cpu_sibling_mask,cpu) )
> +        if ( per_cpu(is_scrubbing, i) )
> +            return;
> +    this_cpu(is_scrubbing) = 1;

If you really want to enforce exclusiveness, you need a single
canonical flag per core (could e.g. be
per_cpu(is_scrubbing, cpumask_first(per_cpu(cpu_sibling_mask, cpu))),
set via test_and_set_bool()).

> +
> +    while ( !softirq_pending(cpu) )
> +    {
> +        if ( !node_need_scrub[node] )
> +        {
> +            /* Free local per cpu list before we exit */
> +            __scrub_free_pages(node, cpu);
> +            goto out;
> +        }

This seems unnecessary: Just have the if() below be

        if ( node_need_scrub[node] && page_list_empty(local_scrub_list) )

along with __scrub_free_pages() returning whether it exited because
of softirq_pending(cpu) (which at once eliminates the need for the
check at the loop header above, allowing this to become a nice
do { ... } while ( !__scrub_free_pages() ).

> +
> +        /* Delist a batch of pages from global scrub list */
> +        if ( page_list_empty(local_scrub_list) )
> +        {
> +            spin_lock(&heap_lock);
> +            for ( zone = 0; zone < NR_ZONES; zone++ )
> +            {
> +                for ( order = MAX_ORDER; order >= 0; order-- )
> +                {
> +                    page_list_for_each_safe( pg, tmp, &heap(node, zone, order) )
> +                    {
> +                        if ( !test_bit(_PGC_need_scrub, &(pg->count_info)) )

Please avoid the inner parentheses here.

> +                            continue;
> +
> +                        page_list_del( pg, &heap(node, zone, order) );
> +                        if ( order > SCRUB_BATCH_ORDER)

Coding style.

> +                        {
> +                            /* putback extra pages */
> +                            i = order;
> +                            while ( i != SCRUB_BATCH_ORDER )

This would better be a do/while construct - on the first iteration you
already know the condition is true.

> +                            {
> +                                PFN_ORDER(pg) = --i;
> +                                page_list_add_tail(pg, &heap(node, zone, i));
> +                                pg += 1 << i;

I realize there are cases where this is also not being done correctly in
existing code, but please use 1UL here and in all similar cases.

> +                            }
> +                            PFN_ORDER(pg) = SCRUB_BATCH_ORDER;
> +                        }
> +
> +                        for ( i = 0; i < (1 << PFN_ORDER(pg)); i++ )

Can't you just use "order" here (and a few lines down)?

> +                        {
> +                            ASSERT( test_bit(_PGC_need_scrub, &pg[i].count_info) );
> +                            ASSERT( !test_bit(_PGC_broken, &pg[i].count_info) );
> +                            mark_page_offline(&pg[i], 0);

mark_page_offline() ???

Jan

> +                        }
> +                        page_list_add_tail(pg, local_scrub_list);
> +                        nr_delisted += ( 1 << PFN_ORDER(pg) );
> +                        if ( nr_delisted >= (1 << SCRUB_BATCH_ORDER) )
> +                        {
> +                            nr_delisted = 0;
> +                            spin_unlock(&heap_lock);
> +                            goto start_scrub;
> +                        }
> +                    }
> +                }
> +            }
> +
> +            node_need_scrub[node] = 0;
> +            spin_unlock(&heap_lock);
> +        }
>  
> + start_scrub:
> +        __scrub_free_pages(node, cpu);
> +    }
> +
> + out:
> +    this_cpu(is_scrubbing) = 0;
> +}

next prev parent reply	other threads:[~2014-07-01  9:12 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-30 13:39 [PATCH v2 1/3] xen: delay page scrubbing to allocation path Bob Liu
2014-06-30 13:39 ` [PATCH v2 2/3] xen: introduce function merge_free_trunks Bob Liu
2014-06-30 15:58   ` Jan Beulich
2014-07-01  8:14     ` Bob Liu
2014-07-01  8:27       ` Jan Beulich
2014-06-30 13:39 ` [PATCH v2 3/3] xen: use idle vcpus to scrub pages Bob Liu
2014-07-01  9:12   ` Jan Beulich [this message]
2014-07-01 12:25     ` Bob Liu
2014-07-01 12:59       ` Jan Beulich
2014-07-02  6:27         ` Bob Liu
2014-07-07 12:20           ` Bob Liu
2014-07-15  9:16         ` Bob Liu
2014-07-23  0:38           ` Konrad Rzeszutek Wilk
2014-07-23  1:30             ` Bob Liu
2014-07-23  7:28           ` Jan Beulich
2014-07-24  2:08             ` Bob Liu
2014-07-24  6:24               ` Jan Beulich
2014-07-25  0:42                 ` Bob Liu
2014-07-25  6:51                   ` Jan Beulich
2014-07-25  7:28                     ` Bob Liu
2014-07-25  7:36                       ` Jan Beulich
2014-07-25  8:18                         ` Bob Liu
2014-07-25  8:28                           ` Jan Beulich
2014-06-30 15:56 ` [PATCH v2 1/3] xen: delay page scrubbing to allocation path Jan Beulich
2014-07-01  8:12   ` Bob Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53B2979C020000780001EE97@mail.emea.novell.com \
    --to=jbeulich@suse.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=ian.campbell@citrix.com \
    --cc=keir@xen.org \
    --cc=lliubbo@gmail.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).