From: Mats Petersson <mats.petersson@citrix.com>
To: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0
Date: Wed, 14 Nov 2012 14:04:44 +0000 [thread overview]
Message-ID: <50A3A4FC.8040408@citrix.com> (raw)
In-Reply-To: <50A397E1.7000602@citrix.com>
On 14/11/12 13:08, David Vrabel wrote:
> Mats,
>
> Your patch has been white-space damaged and will not apply. You should
> use git send-email which will do the right thing. I've also CC'd Konrad
> who is the maintainer for the Xen parts of the kernel.
Thanks for the advice. Will use git send-email for V2 of this patch.
> On 14/11/12 11:13, Mats Petersson wrote:
>> [a long, rambling commit message?]
> The text as-is isn't really suitable for a commit message (too much
> irrelevant stuff) and there is no suitable subject line.
>
>> I have also found that the munmap() call used to unmap the guest memory
>> from Dom0 is about 35% slower in 3.7 kernel than in the 2.6 kernel (3.8M
>> cycles vs 2.8M cycles).
> This performance reduction only occurs with 32-bit guests is the Xen
> then traps-and-emulates both halves of the PTE write.
>
>> I think this could be made quicker by using a
>> direct write of zero rather than the compare exchange operation that is
>> currently used [which traps into Xen, performs the compare & exchange] -
> This is something I noticed but never got around to producing a patch.
> How about this (uncomplied!) patch?
>
> -- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1146,8 +1146,16 @@ again:
> page->index > details->last_index))
> continue;
> }
> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> - tlb->fullmm);
> + /*
> + * No need for the expensive atomic get and
> + * clear for anonymous mappings as the dirty
> + * and young bits are not used.
> + */
> + if (PageAnon(page))
> + pte_clear(mm, addr, pte);
> + else
> + ptent = ptep_get_and_clear_full(mm, addr, pte,
> + tlb->fullmm);
> tlb_remove_tlb_entry(tlb, pte, addr);
> if (unlikely(!page))
> continue;
I came up with something very similar, however found that page is
sometimes NULL which causes problems. if (page && PageAnon(page)) will
probably work... Trying that now.
> Now for the patch itself. On the whole, the approach looks good and the
> real-word performance improvements are nice. Specific comments inline.
>
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -2542,3 +2561,77 @@ out:
>> return err;
>> }
>> EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_range);
>> +
>> +/* like xen_remap_domain_mfn_range, but does a list of mfn's, rather
>> + * than the, for xen, quite useless, consecutive pages.
>> + */
> /*
> * Like xen_remap_domain_mfn_range(), but more efficiently handles MFNs
> * that are not contiguous (which is common for a domain's memory).
> */
Will update.
>> +int xen_remap_domain_mfn_list(struct vm_area_struct *vma,
>> + unsigned long addr,
>> + unsigned long *mfn, int nr,
>> + int *err_ptr,
>> + pgprot_t prot, unsigned domid)
> xen_remap_domain_mfn_array() ? It's not taking a list.
Yes, better name. Although, perhaps following Ian Campbell's suggestion
and changing the xen_remain_domain_mfn_range to "do the right thing" is
even better. I will have a quick look at how difficult that may be - I
feel it may be rather simpler than I first thought, as there is only one
call to this function in privcmd.c.
The arm/xen/enlighten.c will probably also need fixing, I suppose...
>> +{
>> + struct remap_list_data rmd;
>> + struct mmu_update mmu_update[REMAP_BATCH_SIZE];
> This is surprisingly large (256 bytes) but I note that the existing
> xen_remap_domain_mfn_range() does the same thing so I guess it's ok.
I would actually like to increase it, but that would probably require
allocation. One HV call per 32
>> + int batch;
>> + int done;
>> + unsigned long range;
>> + int err = 0;
>> +
>> + if (xen_feature(XENFEAT_auto_translated_physmap))
>> + return -EINVAL;
>> +
>> + prot = __pgprot(pgprot_val(prot) | _PAGE_IOMAP);
>> +
>> + BUG_ON(!((vma->vm_flags & (VM_PFNMAP | VM_IO)) == (VM_PFNMAP |
>> VM_IO)));
>> +
>> + rmd.mfn = mfn;
>> + rmd.prot = prot;
>> +
>> + while (nr) {
>> + batch = min(REMAP_BATCH_SIZE, nr);
>> + range = (unsigned long)batch << PAGE_SHIFT;
>> +
>> + rmd.mmu_update = mmu_update;
>> + err = apply_to_page_range(vma->vm_mm, addr, range,
>> + remap_area_mfn_list_pte_fn, &rmd);
>> + if (err)
>> + {
>> + printk("xen_remap_domain_mfn_list: apply_to_range:
>> err=%d\n", err);
> Stray printk?
Yes.
>> + goto out;
>> + }
>> +
>> + err = HYPERVISOR_mmu_update(mmu_update, batch, &done, domid);
>> + if (err < 0)
>> + {
>> + int i;
>> + /* TODO: We should remove this printk later */
>> + printk("xen_remap_domain_mfn_list: mmu_update: err=%d,
>> done=%d, batch=%d\n", err, done, batch);
> Yes, you should...
>
>> + err_ptr[done] = err;
>> +
>> + /* now do the remaining part of this batch */
>> + for(i = done+1; i < batch; i++)
>> + {
>> + int tmp_err = HYPERVISOR_mmu_update(&mmu_update[i], 1,
>> NULL, domid);
>> + if (tmp_err < 0)
>> + {
>> + err_ptr[i] = tmp_err;
>> + }
>> + }
> There's no need to fall back to doing it page-by-page here. You can
> still batch the remainder.
Yes, I agree.
>> +
>> + goto out;
>> + }
>> +
>> + nr -= batch;
>> + addr += range;
>> + err_ptr += batch;
>> + }
>> +
>> + err = 0;
>> +out:
>> +
>> + xen_flush_tlb_all();
> Probably not that important anymore since we would now do far fewer TLB
> flushes, but this TLB flush is only needed by the PTEs being updated
> were already present -- if they're all clear then TLB flush can be
> omitted entirely.
The 2.6 kernel actually had a check BUG_ON(!pte_none(pte)) [or something
like that], but the 3.x kernel seems to have "lost" that. Either way,
once per, say, 1024 pages isn't measurable in my benchmarks.
>
>> +
>> + return err;
>> +}
>> +EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_list);
>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>> index 8adb9cc..b39a7b7 100644
>> --- a/drivers/xen/privcmd.c
>> +++ b/drivers/xen/privcmd.c
>> @@ -151,6 +151,41 @@ static int traverse_pages(unsigned nelem, size_t size,
>> return ret;
>> }
>>
>> +/*
>> + * Similar to traverse_pages, but use each page as a "block" of
>> + * data to be processed as one unit.
>> + */
>> +static int traverse_pages_block(unsigned nelem, size_t size,
>> + struct list_head *pos,
>> + int (*fn)(void *data, int nr, void *state),
>> + void *state)
>> +{
>> + void *pagedata;
>> + unsigned pageidx;
>> + int ret = 0;
>> +
>> + BUG_ON(size > PAGE_SIZE);
>> +
>> + pageidx = PAGE_SIZE;
>> + pagedata = NULL; /* hush, gcc */
> What was gcc upset about? I don't see anything it could get confused about.
This comes from the original traverse_pages code, which does this. I
will see if it compiles without warnings if I remove the initialization.
>
>> @@ -260,17 +295,17 @@ struct mmap_batch_state {
>> xen_pfn_t __user *user_mfn;
>> };
>>
>> -static int mmap_batch_fn(void *data, void *state)
>> +static int mmap_batch_fn(void *data, int nr, void *state)
>> {
>> xen_pfn_t *mfnp = data;
>> +
>> struct mmap_batch_state *st = state;
>> int ret;
>>
>> - ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK,
>> *mfnp, 1,
>> - st->vma->vm_page_prot, st->domain);
>> + BUG_ON(nr < 0);
> Is this BUG_ON() useful?
I think very bad things will happen when nr is less than zero, but I
guess they will be pretty obvious... So perhaps just letting it die from
the consequences. It does "while(nr)" in the remap function, and then
min(REMAP_BATCH_SIZE, nr) an various other things that will definitely
not work as expected if nr is negative. I thought it's much easier to
detect when it's gone wrong if you assert early tho'.
--
Mats
>
> David
>
>
next prev parent reply other threads:[~2012-11-14 14:04 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-14 11:13 [RFC/PATCH] Improve speed of mapping guest memory into Dom0 Mats Petersson
2012-11-14 13:08 ` David Vrabel
2012-11-14 13:16 ` Ian Campbell
2012-11-14 14:04 ` Mats Petersson [this message]
2012-11-14 16:39 ` David Vrabel
2012-11-14 16:43 ` Mats Petersson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50A3A4FC.8040408@citrix.com \
--to=mats.petersson@citrix.com \
--cc=david.vrabel@citrix.com \
--cc=konrad.wilk@oracle.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.