From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mats Petersson <mats.petersson@citrix.com>
Subject: Re: [RFC/PATCH] Improve speed of mapping guest memory
 into	Dom0
Date: Wed, 14 Nov 2012 14:04:44 +0000
Message-ID: <50A3A4FC.8040408@citrix.com>
References: <50A37CC7.8050700@citrix.com> <50A397E1.7000602@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <50A397E1.7000602@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

On 14/11/12 13:08, David Vrabel wrote:
> Mats,
>
> Your patch has been white-space damaged and will not apply.  You should
> use git send-email which will do the right thing.  I've also CC'd Konrad
> who is the maintainer for the Xen parts of the kernel.
Thanks for the advice. Will use git send-email for V2 of this patch.
> On 14/11/12 11:13, Mats Petersson wrote:
>> [a long, rambling commit message?]
> The text as-is isn't really suitable for a commit message (too much
> irrelevant stuff) and there is no suitable subject line.
>
>> I have also found that the munmap() call used to unmap the guest memory
>> from Dom0 is about 35% slower in 3.7 kernel than in the 2.6 kernel (3.8M
>> cycles vs 2.8M cycles).
> This performance reduction only occurs with 32-bit guests is the Xen
> then traps-and-emulates both halves of the PTE write.
>
>> I think this could be made quicker by using a
>> direct write of zero rather than the compare exchange operation that is
>> currently used [which traps into Xen, performs the compare & exchange] -
> This is something I noticed but never got around to producing a patch.
> How about this (uncomplied!) patch?
>
> -- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1146,8 +1146,16 @@ again:
>   				     page->index > details->last_index))
>   					continue;
>   			}
> -			ptent = ptep_get_and_clear_full(mm, addr, pte,
> -							tlb->fullmm);
> +			/*
> +			 * No need for the expensive atomic get and
> +			 * clear for anonymous mappings as the dirty
> +			 * and young bits are not used.
> +			 */
> +			if (PageAnon(page))
> +				pte_clear(mm, addr, pte);
> +			else
> +				ptent = ptep_get_and_clear_full(mm, addr, pte,
> +								tlb->fullmm);
>   			tlb_remove_tlb_entry(tlb, pte, addr);
>   			if (unlikely(!page))
>   				continue;
I came up with something very similar, however found that page is 
sometimes NULL which causes problems. if (page && PageAnon(page)) will 
probably work... Trying that now.

> Now for the patch itself.  On the whole, the approach looks good and the
> real-word performance improvements are nice.  Specific comments inline.
>
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -2542,3 +2561,77 @@ out:
>>       return err;
>>   }
>>   EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_range);
>> +
>> +/* like xen_remap_domain_mfn_range, but does a list of mfn's, rather
>> + * than the, for xen, quite useless, consecutive pages.
>> + */
> /*
>   * Like xen_remap_domain_mfn_range(), but more efficiently handles MFNs
>   * that are not contiguous (which is common for a domain's memory).
>   */
Will update.
>> +int xen_remap_domain_mfn_list(struct vm_area_struct *vma,
>> +                  unsigned long addr,
>> +                  unsigned long *mfn, int nr,
>> +                  int *err_ptr,
>> +                  pgprot_t prot, unsigned domid)
> xen_remap_domain_mfn_array() ?  It's not taking a list.
Yes, better name. Although, perhaps following Ian Campbell's suggestion 
and changing the xen_remain_domain_mfn_range to "do the right thing" is 
even better. I will have a quick look at how difficult that may be - I 
feel it may be rather simpler than I first thought, as there is only one 
call to this function in privcmd.c.

The arm/xen/enlighten.c will probably also need fixing, I suppose...

>> +{
>> +    struct remap_list_data rmd;
>> +    struct mmu_update mmu_update[REMAP_BATCH_SIZE];
> This is surprisingly large (256 bytes) but I note that the existing
> xen_remap_domain_mfn_range() does the same thing so I guess it's ok.
I would actually like to increase it, but that would probably require 
allocation. One HV call per 32
>> +    int batch;
>> +    int done;
>> +    unsigned long range;
>> +    int err = 0;
>> +
>> +    if (xen_feature(XENFEAT_auto_translated_physmap))
>> +        return -EINVAL;
>> +
>> +    prot = __pgprot(pgprot_val(prot) | _PAGE_IOMAP);
>> +
>> +    BUG_ON(!((vma->vm_flags & (VM_PFNMAP | VM_IO)) == (VM_PFNMAP |
>> VM_IO)));
>> +
>> +    rmd.mfn = mfn;
>> +    rmd.prot = prot;
>> +
>> +    while (nr) {
>> +        batch = min(REMAP_BATCH_SIZE, nr);
>> +        range = (unsigned long)batch << PAGE_SHIFT;
>> +
>> +        rmd.mmu_update = mmu_update;
>> +        err = apply_to_page_range(vma->vm_mm, addr, range,
>> +                      remap_area_mfn_list_pte_fn, &rmd);
>> +        if (err)
>> +        {
>> +            printk("xen_remap_domain_mfn_list: apply_to_range:
>> err=%d\n", err);
> Stray printk?
Yes.
>> +            goto out;
>> +        }
>> +
>> +        err = HYPERVISOR_mmu_update(mmu_update, batch, &done, domid);
>> +        if (err < 0)
>> +        {
>> +            int i;
>> +            /* TODO: We should remove this printk later */
>> +            printk("xen_remap_domain_mfn_list: mmu_update: err=%d,
>> done=%d, batch=%d\n", err, done, batch);
> Yes, you should...
>
>> +            err_ptr[done] = err;
>> +
>> +            /* now do the remaining part of this batch */
>> +            for(i = done+1; i < batch; i++)
>> +            {
>> +                int tmp_err = HYPERVISOR_mmu_update(&mmu_update[i], 1,
>> NULL, domid);
>> +                if (tmp_err < 0)
>> +                {
>> +                    err_ptr[i] = tmp_err;
>> +                }
>> +            }
> There's no need to fall back to doing it page-by-page here. You can
> still batch the remainder.
Yes, I agree.
>> +
>> +            goto out;
>> +        }
>> +
>> +        nr -= batch;
>> +        addr += range;
>> +        err_ptr += batch;
>> +    }
>> +
>> +    err = 0;
>> +out:
>> +
>> +    xen_flush_tlb_all();
> Probably not that important anymore since we would now do far fewer TLB
> flushes, but this TLB flush is only needed by the PTEs being updated
> were already present -- if they're all clear then TLB flush can be
> omitted entirely.
The 2.6 kernel actually had a check BUG_ON(!pte_none(pte)) [or something 
like that], but the 3.x kernel seems to have "lost" that. Either way, 
once per, say, 1024 pages isn't measurable in my benchmarks.
>
>> +
>> +    return err;
>> +}
>> +EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_list);
>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>> index 8adb9cc..b39a7b7 100644
>> --- a/drivers/xen/privcmd.c
>> +++ b/drivers/xen/privcmd.c
>> @@ -151,6 +151,41 @@ static int traverse_pages(unsigned nelem, size_t size,
>>       return ret;
>>   }
>>
>> +/*
>> + * Similar to traverse_pages, but use each page as a "block" of
>> + * data to be processed as one unit.
>> + */
>> +static int traverse_pages_block(unsigned nelem, size_t size,
>> +                struct list_head *pos,
>> +                int (*fn)(void *data, int nr, void *state),
>> +                void *state)
>> +{
>> +    void *pagedata;
>> +    unsigned pageidx;
>> +    int ret = 0;
>> +
>> +    BUG_ON(size > PAGE_SIZE);
>> +
>> +    pageidx = PAGE_SIZE;
>> +    pagedata = NULL;    /* hush, gcc */
> What was gcc upset about?  I don't see anything it could get confused about.
This comes from the original traverse_pages code, which does this. I 
will see if it compiles without warnings if I remove the initialization.
>
>> @@ -260,17 +295,17 @@ struct mmap_batch_state {
>>       xen_pfn_t __user *user_mfn;
>>   };
>>
>> -static int mmap_batch_fn(void *data, void *state)
>> +static int mmap_batch_fn(void *data, int nr, void *state)
>>   {
>>       xen_pfn_t *mfnp = data;
>> +
>>       struct mmap_batch_state *st = state;
>>       int ret;
>>
>> -    ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK,
>> *mfnp, 1,
>> -                     st->vma->vm_page_prot, st->domain);
>> +    BUG_ON(nr < 0);
> Is this BUG_ON() useful?
I think very bad things will happen when nr is less than zero, but I 
guess they will be pretty obvious... So perhaps just letting it die from 
the consequences. It does "while(nr)" in the remap function, and then 
min(REMAP_BATCH_SIZE, nr) an various other things that will definitely 
not work as expected if nr is negative. I thought it's much easier to 
detect when it's gone wrong if you assert early tho'.

--
Mats
>
> David
>
>