From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mats Petersson Subject: Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0 Date: Wed, 14 Nov 2012 14:04:44 +0000 Message-ID: <50A3A4FC.8040408@citrix.com> References: <50A37CC7.8050700@citrix.com> <50A397E1.7000602@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <50A397E1.7000602@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: David Vrabel Cc: Konrad Rzeszutek Wilk , "xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org On 14/11/12 13:08, David Vrabel wrote: > Mats, > > Your patch has been white-space damaged and will not apply. You should > use git send-email which will do the right thing. I've also CC'd Konrad > who is the maintainer for the Xen parts of the kernel. Thanks for the advice. Will use git send-email for V2 of this patch. > On 14/11/12 11:13, Mats Petersson wrote: >> [a long, rambling commit message?] > The text as-is isn't really suitable for a commit message (too much > irrelevant stuff) and there is no suitable subject line. > >> I have also found that the munmap() call used to unmap the guest memory >> from Dom0 is about 35% slower in 3.7 kernel than in the 2.6 kernel (3.8M >> cycles vs 2.8M cycles). > This performance reduction only occurs with 32-bit guests is the Xen > then traps-and-emulates both halves of the PTE write. > >> I think this could be made quicker by using a >> direct write of zero rather than the compare exchange operation that is >> currently used [which traps into Xen, performs the compare & exchange] - > This is something I noticed but never got around to producing a patch. > How about this (uncomplied!) patch? > > -- a/mm/memory.c > +++ b/mm/memory.c > @@ -1146,8 +1146,16 @@ again: > page->index > details->last_index)) > continue; > } > - ptent = ptep_get_and_clear_full(mm, addr, pte, > - tlb->fullmm); > + /* > + * No need for the expensive atomic get and > + * clear for anonymous mappings as the dirty > + * and young bits are not used. > + */ > + if (PageAnon(page)) > + pte_clear(mm, addr, pte); > + else > + ptent = ptep_get_and_clear_full(mm, addr, pte, > + tlb->fullmm); > tlb_remove_tlb_entry(tlb, pte, addr); > if (unlikely(!page)) > continue; I came up with something very similar, however found that page is sometimes NULL which causes problems. if (page && PageAnon(page)) will probably work... Trying that now. > Now for the patch itself. On the whole, the approach looks good and the > real-word performance improvements are nice. Specific comments inline. > >> --- a/arch/x86/xen/mmu.c >> +++ b/arch/x86/xen/mmu.c >> @@ -2542,3 +2561,77 @@ out: >> return err; >> } >> EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_range); >> + >> +/* like xen_remap_domain_mfn_range, but does a list of mfn's, rather >> + * than the, for xen, quite useless, consecutive pages. >> + */ > /* > * Like xen_remap_domain_mfn_range(), but more efficiently handles MFNs > * that are not contiguous (which is common for a domain's memory). > */ Will update. >> +int xen_remap_domain_mfn_list(struct vm_area_struct *vma, >> + unsigned long addr, >> + unsigned long *mfn, int nr, >> + int *err_ptr, >> + pgprot_t prot, unsigned domid) > xen_remap_domain_mfn_array() ? It's not taking a list. Yes, better name. Although, perhaps following Ian Campbell's suggestion and changing the xen_remain_domain_mfn_range to "do the right thing" is even better. I will have a quick look at how difficult that may be - I feel it may be rather simpler than I first thought, as there is only one call to this function in privcmd.c. The arm/xen/enlighten.c will probably also need fixing, I suppose... >> +{ >> + struct remap_list_data rmd; >> + struct mmu_update mmu_update[REMAP_BATCH_SIZE]; > This is surprisingly large (256 bytes) but I note that the existing > xen_remap_domain_mfn_range() does the same thing so I guess it's ok. I would actually like to increase it, but that would probably require allocation. One HV call per 32 >> + int batch; >> + int done; >> + unsigned long range; >> + int err = 0; >> + >> + if (xen_feature(XENFEAT_auto_translated_physmap)) >> + return -EINVAL; >> + >> + prot = __pgprot(pgprot_val(prot) | _PAGE_IOMAP); >> + >> + BUG_ON(!((vma->vm_flags & (VM_PFNMAP | VM_IO)) == (VM_PFNMAP | >> VM_IO))); >> + >> + rmd.mfn = mfn; >> + rmd.prot = prot; >> + >> + while (nr) { >> + batch = min(REMAP_BATCH_SIZE, nr); >> + range = (unsigned long)batch << PAGE_SHIFT; >> + >> + rmd.mmu_update = mmu_update; >> + err = apply_to_page_range(vma->vm_mm, addr, range, >> + remap_area_mfn_list_pte_fn, &rmd); >> + if (err) >> + { >> + printk("xen_remap_domain_mfn_list: apply_to_range: >> err=%d\n", err); > Stray printk? Yes. >> + goto out; >> + } >> + >> + err = HYPERVISOR_mmu_update(mmu_update, batch, &done, domid); >> + if (err < 0) >> + { >> + int i; >> + /* TODO: We should remove this printk later */ >> + printk("xen_remap_domain_mfn_list: mmu_update: err=%d, >> done=%d, batch=%d\n", err, done, batch); > Yes, you should... > >> + err_ptr[done] = err; >> + >> + /* now do the remaining part of this batch */ >> + for(i = done+1; i < batch; i++) >> + { >> + int tmp_err = HYPERVISOR_mmu_update(&mmu_update[i], 1, >> NULL, domid); >> + if (tmp_err < 0) >> + { >> + err_ptr[i] = tmp_err; >> + } >> + } > There's no need to fall back to doing it page-by-page here. You can > still batch the remainder. Yes, I agree. >> + >> + goto out; >> + } >> + >> + nr -= batch; >> + addr += range; >> + err_ptr += batch; >> + } >> + >> + err = 0; >> +out: >> + >> + xen_flush_tlb_all(); > Probably not that important anymore since we would now do far fewer TLB > flushes, but this TLB flush is only needed by the PTEs being updated > were already present -- if they're all clear then TLB flush can be > omitted entirely. The 2.6 kernel actually had a check BUG_ON(!pte_none(pte)) [or something like that], but the 3.x kernel seems to have "lost" that. Either way, once per, say, 1024 pages isn't measurable in my benchmarks. > >> + >> + return err; >> +} >> +EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_list); >> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c >> index 8adb9cc..b39a7b7 100644 >> --- a/drivers/xen/privcmd.c >> +++ b/drivers/xen/privcmd.c >> @@ -151,6 +151,41 @@ static int traverse_pages(unsigned nelem, size_t size, >> return ret; >> } >> >> +/* >> + * Similar to traverse_pages, but use each page as a "block" of >> + * data to be processed as one unit. >> + */ >> +static int traverse_pages_block(unsigned nelem, size_t size, >> + struct list_head *pos, >> + int (*fn)(void *data, int nr, void *state), >> + void *state) >> +{ >> + void *pagedata; >> + unsigned pageidx; >> + int ret = 0; >> + >> + BUG_ON(size > PAGE_SIZE); >> + >> + pageidx = PAGE_SIZE; >> + pagedata = NULL; /* hush, gcc */ > What was gcc upset about? I don't see anything it could get confused about. This comes from the original traverse_pages code, which does this. I will see if it compiles without warnings if I remove the initialization. > >> @@ -260,17 +295,17 @@ struct mmap_batch_state { >> xen_pfn_t __user *user_mfn; >> }; >> >> -static int mmap_batch_fn(void *data, void *state) >> +static int mmap_batch_fn(void *data, int nr, void *state) >> { >> xen_pfn_t *mfnp = data; >> + >> struct mmap_batch_state *st = state; >> int ret; >> >> - ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, >> *mfnp, 1, >> - st->vma->vm_page_prot, st->domain); >> + BUG_ON(nr < 0); > Is this BUG_ON() useful? I think very bad things will happen when nr is less than zero, but I guess they will be pretty obvious... So perhaps just letting it die from the consequences. It does "while(nr)" in the remap function, and then min(REMAP_BATCH_SIZE, nr) an various other things that will definitely not work as expected if nr is negative. I thought it's much easier to detect when it's gone wrong if you assert early tho'. -- Mats > > David > >