* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
@ 2013-10-20 4:22 Jaeyong Yoo
0 siblings, 0 replies; 18+ messages in thread
From: Jaeyong Yoo @ 2013-10-20 4:22 UTC (permalink / raw)
To: Ian Campbell; +Cc: 'Stefano Stabellini', xen-devel@lists.xen.org
> > > I meant the pte frobbing immediately before the call the add_mapped_vaddr
> > > quoted above.
> > >
> >
> > I don't think it is necessary because at the moment the dirty bitmap
> > is constructed based on the VAs in the linked list. If the PTE frobbing
> > happens
> > immediately before add_mapped_vaddr, the corresponding dirty-page would be
> > checked at the next round get-dirty-gitmap.
>
> OK, thanks. Can we get a code comment on the ordering constraints her
please.
Sure, of course.
Jaeyong.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing @ 2013-10-08 6:29 Jaeyong Yoo 2013-10-08 8:46 ` Ian Campbell 0 siblings, 1 reply; 18+ messages in thread From: Jaeyong Yoo @ 2013-10-08 6:29 UTC (permalink / raw) To: Julien Grall; +Cc: xen-devel@lists.xen.org >------- Original Message ------- >Sender : Julien Grall<julien.grall@citrix.com> >Date : 2013-10-07 22:02 (GMT+09:00) >Title : Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing > >On 10/04/2013 05:44 AM, Jaeyong Yoo wrote: >> Add hypercall (shadow op: enable/disable and clean/peek dirted page bitmap). >> It consists of two parts: dirty page detecting and saving. >> For detecting, we setup the guest p2m's leaf PTE read-only and whenever >> the guest tries to write something, permission fault happens and traps into xen. >> The permission-faulted GPA should be saved for the toolstack (when it wants to see >> which pages are dirted). For this purpose, we temporarily save the GPAs into linked >> list by using 'add_mapped_vaddr' function and when toolstack wants >> (log_dirty_op function) the GPAs are copied into bitmap and the linnked list is flushed. >> >> Additionally, for supporting parallel migration of domUs, vlpt area should be context >> switched. >> >> Signed-off-by: Jaeyong Yoo >> --- > >[..] > >> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c >> index 4c0fc32..3b78ed2 100644 >> --- a/xen/arch/arm/traps.c >> +++ b/xen/arch/arm/traps.c >> @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, >> const char *msg; >> int rc, level = -1; >> mmio_info_t info; >> + int page_fault = ((dabt.dfsc & FSC_MASK) == >> + (FSC_FLT_PERM + FSC_3D_LEVEL) && dabt.write); >> >> if ( !check_conditional_instr(regs, hsr) ) >> { >> @@ -1327,22 +1329,23 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, >> info.gva = READ_SYSREG64(FAR_EL2); >> #endif >> >> - if (dabt.s1ptw) >> + if ( dabt.s1ptw && !page_fault ) > >I think checking !page_fault is nearly everywhere is error-prone when >this function will be modified. > >Can you do something like this? > >if ( page_fault ) > // Your code to handle page fault >else >{ > // handle_mmio >} > >It will avoid && !page_fault. That looks better. >> goto bad_data_abort; >> >> rc = gva_to_ipa(info.gva, &info.gpa); >> - if ( rc == -EFAULT ) >> + if ( rc == -EFAULT && !page_fault ) >> goto bad_data_abort; >> >> /* XXX: Decode the instruction if ISS is not valid */ >> - if ( !dabt.valid ) >> + if ( !dabt.valid && !page_fault ) >> goto bad_data_abort; >> >> /* >> * Erratum 766422: Thumb store translation fault to Hypervisor may >> * not have correct HSR Rt value. >> */ >> - if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write ) >> + if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write >> + && !page_fault) >> { >> rc = decode_instruction(regs, &info.dabt); >> if ( rc ) >> @@ -1358,6 +1361,16 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, >> return; >> } >> >> + /* handle permission fault on write */ >> + if ( page_fault ) >> + { >> + if ( current->domain->arch.dirty.mode == 0 ) >> + goto bad_data_abort; >> + >> + handle_page_fault(current->domain, info.gpa); > >You must call advance_pc(regs, hsr) here. I got it. > >> + return; >> + } >> + >> bad_data_abort: > >-- >Julien Grall ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-08 6:29 Jaeyong Yoo @ 2013-10-08 8:46 ` Ian Campbell 2013-10-08 9:46 ` Jaeyong Yoo 0 siblings, 1 reply; 18+ messages in thread From: Ian Campbell @ 2013-10-08 8:46 UTC (permalink / raw) To: jaeyong.yoo; +Cc: Julien Grall, xen-devel@lists.xen.org On Tue, 2013-10-08 at 06:29 +0000, Jaeyong Yoo wrote: > >> + /* handle permission fault on write */ > >> + if ( page_fault ) > >> + { > >> + if ( current->domain->arch.dirty.mode == 0 ) > >> + goto bad_data_abort; > >> + > >> + handle_page_fault(current->domain, info.gpa); > > > >You must call advance_pc(regs, hsr) here. > > I got it. Are you sure about this? Eugene's reasoning that we need to replay the faulting instruction seemed pretty sound. Ian. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-08 8:46 ` Ian Campbell @ 2013-10-08 9:46 ` Jaeyong Yoo 2013-10-08 15:36 ` Eugene Fedotov 0 siblings, 1 reply; 18+ messages in thread From: Jaeyong Yoo @ 2013-10-08 9:46 UTC (permalink / raw) To: 'Ian Campbell'; +Cc: 'Julien Grall', xen-devel > -----Original Message----- > From: xen-devel-bounces@lists.xen.org [mailto:xen-devel- > bounces@lists.xen.org] On Behalf Of Ian Campbell > Sent: Tuesday, October 08, 2013 5:46 PM > To: jaeyong.yoo@samsung.com > Cc: Julien Grall; xen-devel@lists.xen.org > Subject: Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for > dirty page tracing > > On Tue, 2013-10-08 at 06:29 +0000, Jaeyong Yoo wrote: > > > >> + /* handle permission fault on write */ > > >> + if ( page_fault ) > > >> + { > > >> + if ( current->domain->arch.dirty.mode == 0 ) > > >> + goto bad_data_abort; > > >> + > > >> + handle_page_fault(current->domain, info.gpa); > > > > > >You must call advance_pc(regs, hsr) here. > > > > I got it. > > Are you sure about this? Eugene's reasoning that we need to replay the > faulting instruction seemed pretty sound. Oh, I've got confused. I think Eugene's reasoning is correct. handle_mmio is for emulating and handle_page_fault is the generic fault, which should be resolved and repeat the instruction. Sorry for the confusion. Jaeyong > > Ian. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-08 9:46 ` Jaeyong Yoo @ 2013-10-08 15:36 ` Eugene Fedotov 2013-10-10 14:13 ` Julien Grall 0 siblings, 1 reply; 18+ messages in thread From: Eugene Fedotov @ 2013-10-08 15:36 UTC (permalink / raw) To: xen-devel If we move necessary checking into handle_page_fault routine, so the resulting patch for traps.c will look more simple: --- a/xen/arch/arm/traps.c +++ b/xen/arch/arm/traps.c @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, const char *msg; int rc, level = -1; mmio_info_t info; + int page_fault = ( (dabt.dfsc & FSC_MASK) == + (FSC_FLT_PERM | FSC_3D_LEVEL) && dabt.write ); if ( !check_conditional_instr(regs, hsr) ) { @@ -1334,6 +1336,13 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, if ( rc == -EFAULT ) goto bad_data_abort; + /* domU page fault handling for guest live migration */ + /* dabt.valid can be 0 here */ + if ( page_fault && handle_page_fault(current->domain, info.gpa) ) + { + /* Do not modify pc after page fault to repeat memory operation */ + return; + } /* XXX: Decode the instruction if ISS is not valid */ if ( !dabt.valid ) goto bad_data_abort; Will it be acceptable, or you think "else" statement looks more better? Where handle_page_fault returns zero if dirty page tracing is not enabled for domain (dirty.mode==0) or address is not valid for domain (having memory map we check it), so MMIO and faults from dom0 are not handled by handle_page_fault. The handle_page_fault routine (see [PATCH v4 7/9] xen/arm: Implement virtual-linear page table for guest p2m mapping in live migration) will be: int handle_page_fault(struct domain *d, paddr_t addr) { lpae_t *vlp2m_pte = 0; vaddr_t start, end; if (!d->arch.dirty.mode) return 0; /* Ensure that addr is inside guest's RAM */ get_gma_start_end(d, &start, &end); if ( addr < start || addr > end ) return 0; vlp2m_pte = get_vlpt_3lvl_pte(addr); if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 ) { lpae_t pte = *vlp2m_pte; pte.p2m.write = 1; write_pte(vlp2m_pte, pte); flush_tlb_local(); /* in order to remove mappings in cleanup stage */ add_mapped_vaddr(d, addr); } return 1; } Best, Evgeny. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-08 15:36 ` Eugene Fedotov @ 2013-10-10 14:13 ` Julien Grall 2013-10-16 13:44 ` Ian Campbell 0 siblings, 1 reply; 18+ messages in thread From: Julien Grall @ 2013-10-10 14:13 UTC (permalink / raw) To: Eugene Fedotov; +Cc: xen-devel On 10/08/2013 04:36 PM, Eugene Fedotov wrote: > If we move necessary checking into handle_page_fault routine, so the > resulting patch for traps.c will look more simple: > > --- a/xen/arch/arm/traps.c > +++ b/xen/arch/arm/traps.c > @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct > cpu_user_regs *regs, > const char *msg; > int rc, level = -1; > mmio_info_t info; > + int page_fault = ( (dabt.dfsc & FSC_MASK) == > + (FSC_FLT_PERM | FSC_3D_LEVEL) && dabt.write ); > > if ( !check_conditional_instr(regs, hsr) ) > { > @@ -1334,6 +1336,13 @@ static void do_trap_data_abort_guest(struct > cpu_user_regs *regs, > if ( rc == -EFAULT ) > goto bad_data_abort; > > + /* domU page fault handling for guest live migration */ > + /* dabt.valid can be 0 here */ > + if ( page_fault && handle_page_fault(current->domain, info.gpa) ) > + { > + /* Do not modify pc after page fault to repeat memory operation */ > + return; > + } > /* XXX: Decode the instruction if ISS is not valid */ > if ( !dabt.valid ) > goto bad_data_abort; > > Will it be acceptable, or you think "else" statement looks more better? I'm fine with this solution. Cheers, -- Julien Grall ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-10 14:13 ` Julien Grall @ 2013-10-16 13:44 ` Ian Campbell 0 siblings, 0 replies; 18+ messages in thread From: Ian Campbell @ 2013-10-16 13:44 UTC (permalink / raw) To: Julien Grall; +Cc: Eugene Fedotov, xen-devel On Thu, 2013-10-10 at 15:13 +0100, Julien Grall wrote: > On 10/08/2013 04:36 PM, Eugene Fedotov wrote: > > If we move necessary checking into handle_page_fault routine, so the > > resulting patch for traps.c will look more simple: > > > > --- a/xen/arch/arm/traps.c > > +++ b/xen/arch/arm/traps.c > > @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct > > cpu_user_regs *regs, > > const char *msg; > > int rc, level = -1; > > mmio_info_t info; > > + int page_fault = ( (dabt.dfsc & FSC_MASK) == > > + (FSC_FLT_PERM | FSC_3D_LEVEL) && dabt.write ); > > > > if ( !check_conditional_instr(regs, hsr) ) > > { > > @@ -1334,6 +1336,13 @@ static void do_trap_data_abort_guest(struct > > cpu_user_regs *regs, > > if ( rc == -EFAULT ) > > goto bad_data_abort; > > > > + /* domU page fault handling for guest live migration */ > > + /* dabt.valid can be 0 here */ > > + if ( page_fault && handle_page_fault(current->domain, info.gpa) ) > > + { > > + /* Do not modify pc after page fault to repeat memory operation */ > > + return; > > + } > > /* XXX: Decode the instruction if ISS is not valid */ > > if ( !dabt.valid ) > > goto bad_data_abort; > > > > Will it be acceptable, or you think "else" statement looks more better? > > I'm fine with this solution. Looks good to me too, assuming you are happy with the ordering of MMIO handling vs page faults. > > Cheers, > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 0/9] xen/arm: live migration support in arndale board @ 2013-10-04 4:43 Jaeyong Yoo 2013-10-04 4:44 ` [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing Jaeyong Yoo 0 siblings, 1 reply; 18+ messages in thread From: Jaeyong Yoo @ 2013-10-04 4:43 UTC (permalink / raw) To: xen-devel; +Cc: Jaeyong Yoo Hello xen-devel, here goes the v4 patch series. The major changes in this version is the following: 1) VLPT is improved. In version 3, we manually construct the xen's page table to establish VLPT, but in this version, by simply copying the guest p2m's entry into xen's page table, we can accomplish the same with smaller LOC and faster dirty-page detection. 2) Enable SMP-domu live migration. 3) Stability improved. For single-core domU live migration, we can migrate 512 times in a row (512 is the maximum number of domain ID, as far as I know). For SMP-core domU live migration, we can migrate 50 ~ 300 times in a row, (minimum 50), and after that dom0 or domU crashes. We are trying to figure out the reason of faults. I hope I could get some feedbacks from community members about this. And, here goes the brief description of each patch. Patch 1 implements hvm save/restore. Patch 2 implements vcpu save/restore by adding more required registers for live migration. Patch 3 implements 'set_memory_map' hyerpcall for telling hypervisor about the DomU's memory map. This memory map is used for dirty-page tracing (Patch 4, 7, 8, and 9) Patch 4 implements 'get_maximum_gpfn' Patch 5 implements 'modify_returncode' for switching the return value of suspend hypercall from domU. Patch 6 is an obvious fix of a clear_guest_offset macro. Patch 7 implements base functions for VLPT. Patch 8 implements dirty-page tracing by using VLPT. Patch 9 implements the toolstack part for live migration of ARM. Best, Jaeyong Alexey Sokolov, Elena Pyatunina, Evgeny Fedotov, and Nikolay Martyanov (1): xen/arm: Implement toolstack for xl restore/save and migrate Alexey Sokolov (1): xen/arm: Implement modify_returncode Evgeny Fedotov (2): xen/arm: Implement set_memory_map hypercall xen/arm: Implement get_maximum_gpfn hypercall for arm Jaeyong Yoo and Evgeny Fedotov (1): xen/arm: Implement hvm save and restore Jaeyong Yoo and Alexey Sokolov (1): xen/arm: Add more registers for saving and restoring vcpu registers Jaeyong Yoo and Elena Pyatunina (1) xen/arm: Implement hypercall for dirty page tracing Jaeyong Yoo (2): xen/arm: Implement virtual-linear page table for guest p2m mapping in live migration xen/arm: Fixing clear_guest_offset macro config/arm32.mk | 1 + tools/include/xen-foreign/reference.size | 2 +- tools/libxc/Makefile | 5 + tools/libxc/xc_arm_migrate.c | 805 +++++++++++++++++++++++++++++++ tools/libxc/xc_dom_arm.c | 12 +- tools/libxc/xc_domain.c | 44 ++ tools/libxc/xc_resume.c | 25 + tools/libxc/xenctrl.h | 23 + tools/misc/Makefile | 4 + xen/arch/arm/Makefile | 1 + xen/arch/arm/domain.c | 52 ++ xen/arch/arm/domctl.c | 136 +++++- xen/arch/arm/hvm.c | 228 +++++++++ xen/arch/arm/mm.c | 190 +++++++- xen/arch/arm/p2m.c | 300 ++++++++++++ xen/arch/arm/save.c | 66 +++ xen/arch/arm/traps.c | 21 +- xen/common/Makefile | 2 + xen/include/asm-arm/config.h | 6 + xen/include/asm-arm/domain.h | 14 + xen/include/asm-arm/guest_access.h | 5 +- xen/include/asm-arm/hvm/support.h | 29 ++ xen/include/asm-arm/mm.h | 29 ++ xen/include/asm-arm/p2m.h | 4 + xen/include/asm-arm/processor.h | 2 + xen/include/public/arch-arm.h | 35 ++ xen/include/public/arch-arm/hvm/save.h | 66 +++ xen/include/public/memory.h | 15 +- xen/include/xsm/dummy.h | 5 + xen/include/xsm/xsm.h | 5 + 30 files changed, 2119 insertions(+), 13 deletions(-) create mode 100644 tools/libxc/xc_arm_migrate.c create mode 100644 xen/arch/arm/save.c create mode 100644 xen/include/asm-arm/hvm/support.h -- 1.8.1.2 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-04 4:43 [PATCH v4 0/9] xen/arm: live migration support in arndale board Jaeyong Yoo @ 2013-10-04 4:44 ` Jaeyong Yoo 2013-10-07 13:02 ` Julien Grall 2013-10-16 13:44 ` Ian Campbell 0 siblings, 2 replies; 18+ messages in thread From: Jaeyong Yoo @ 2013-10-04 4:44 UTC (permalink / raw) To: xen-devel; +Cc: Jaeyong Yoo Add hypercall (shadow op: enable/disable and clean/peek dirted page bitmap). It consists of two parts: dirty page detecting and saving. For detecting, we setup the guest p2m's leaf PTE read-only and whenever the guest tries to write something, permission fault happens and traps into xen. The permission-faulted GPA should be saved for the toolstack (when it wants to see which pages are dirted). For this purpose, we temporarily save the GPAs into linked list by using 'add_mapped_vaddr' function and when toolstack wants (log_dirty_op function) the GPAs are copied into bitmap and the linnked list is flushed. Additionally, for supporting parallel migration of domUs, vlpt area should be context switched. Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com> --- xen/arch/arm/domain.c | 15 ++ xen/arch/arm/domctl.c | 13 ++ xen/arch/arm/mm.c | 23 ++- xen/arch/arm/p2m.c | 300 ++++++++++++++++++++++++++++++++++++++++ xen/arch/arm/traps.c | 21 ++- xen/include/asm-arm/domain.h | 5 + xen/include/asm-arm/mm.h | 4 + xen/include/asm-arm/p2m.h | 4 + xen/include/asm-arm/processor.h | 2 + 9 files changed, 382 insertions(+), 5 deletions(-) diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index 57be341..40ebeed 100644 --- a/xen/arch/arm/domain.c +++ b/xen/arch/arm/domain.c @@ -215,6 +215,10 @@ static void ctxt_switch_to(struct vcpu *n) WRITE_SYSREG(hcr, HCR_EL2); isb(); + /* vlpt area context switching for dirty-page tracing */ + if ( n->domain->arch.dirty.mode ) + restore_vlpt(n->domain); + /* This is could trigger an hardware interrupt from the virtual * timer. The interrupt needs to be injected into the guest. */ virt_timer_restore(n); @@ -512,6 +516,17 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags) spin_lock_init(&d->arch.map_lock); d->arch.map_domain.nr_banks = 0; + /* init for dirty-page tracing */ + d->arch.dirty.count = 0; + d->arch.dirty.mode = 0; + d->arch.dirty.head = NULL; + d->arch.dirty.mvn_head = NULL; + spin_lock_init(&d->arch.dirty.lock); + + d->arch.dirty.second_lvl_start = 0; + d->arch.dirty.second_lvl_end = 0; + d->arch.dirty.second_lvl = NULL; + clear_page(d->shared_info); share_xen_page_with_guest( virt_to_page(d->shared_info), d, XENSHARE_writable); diff --git a/xen/arch/arm/domctl.c b/xen/arch/arm/domctl.c index 78fb322..7edce12 100644 --- a/xen/arch/arm/domctl.c +++ b/xen/arch/arm/domctl.c @@ -92,6 +92,19 @@ long arch_do_domctl(struct xen_domctl *domctl, struct domain *d, xfree(c.data); } break; + case XEN_DOMCTL_shadow_op: + { + domain_pause(d); + ret = dirty_mode_op(d, &domctl->u.shadow_op); + domain_unpause(d); + + if ( (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_CLEAN || + (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_PEEK ) + { + copyback = 1; + } + } + break; default: return -EINVAL; diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c index 98edbc9..5b9d26d 100644 --- a/xen/arch/arm/mm.c +++ b/xen/arch/arm/mm.c @@ -838,7 +838,6 @@ void destroy_xen_mappings(unsigned long v, unsigned long e) create_xen_entries(REMOVE, v, 0, (e - v) >> PAGE_SHIFT, 0); } -enum mg { mg_clear, mg_ro, mg_rw, mg_rx }; static void set_pte_flags_on_range(const char *p, unsigned long l, enum mg mg) { lpae_t pte; @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d) unmap_domain_page_global(d->arch.dirty.second_lvl); } +/* routine for dirty-page tracing + * + * On first write, it page faults, its entry is changed to read-write, + * and on retry the write succeeds. + * + * for locating p2m of the faulting entry, we use virtual-linear page table. + */ +void handle_page_fault(struct domain *d, paddr_t addr) +{ + lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr); + if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 ) + { + lpae_t pte = *vlp2m_pte; + pte.p2m.write = 1; + write_pte(vlp2m_pte, pte); + flush_tlb_local(); + + /* in order to remove mappings in cleanup stage */ + add_mapped_vaddr(d, addr); + } +} + /* * Local variables: * mode: C diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index 2d09fef..3b2b11a 100644 --- a/xen/arch/arm/p2m.c +++ b/xen/arch/arm/p2m.c @@ -6,6 +6,8 @@ #include <xen/bitops.h> #include <asm/flushtlb.h> #include <asm/gic.h> +#include <xen/guest_access.h> +#include <xen/pfn.h> void dump_p2m_lookup(struct domain *d, paddr_t addr) { @@ -408,6 +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long gpfn) return p >> PAGE_SHIFT; } +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - sizeof(int))\ + / sizeof(unsigned long) + +/* An array-based linked list for storing virtual addresses (i.e., the 3rd + * level table mapping) that should be destroyed after live migration */ +struct mapped_va_node +{ + struct page_info *next; + struct mapped_va_node *mvn_next; + int items; + unsigned long vaddrs[MAX_VA_PER_NODE]; +}; + +int add_mapped_vaddr(struct domain *d, unsigned long va) +{ + struct page_info *page_head; + struct mapped_va_node *mvn_head; + + spin_lock(&d->arch.dirty.lock); + + page_head = d->arch.dirty.head; + mvn_head = d->arch.dirty.mvn_head; + if ( !mvn_head ) + { + page_head = alloc_domheap_page(NULL, 0); + if ( !page_head ) + { + spin_unlock(&d->arch.dirty.lock); + return -ENOMEM; + } + + mvn_head = map_domain_page_global(__page_to_mfn(page_head)); + mvn_head->items = 0; + mvn_head->next = NULL; + mvn_head->mvn_next = NULL; + d->arch.dirty.head = page_head; + d->arch.dirty.mvn_head = mvn_head; + } + + if ( mvn_head->items == MAX_VA_PER_NODE ) + { + struct page_info *page; + + page = alloc_domheap_page(NULL, 0); + if ( !page ) + { + spin_unlock(&d->arch.dirty.lock); + return -ENOMEM; + } + + mvn_head = map_domain_page_global(__page_to_mfn(page)); + mvn_head->items = 0; + mvn_head->next = d->arch.dirty.head; + mvn_head->mvn_next = d->arch.dirty.mvn_head; + + d->arch.dirty.mvn_head = mvn_head; + d->arch.dirty.head = page; + } + + mvn_head->vaddrs[mvn_head->items] = va; + mvn_head->items ++; + + spin_unlock(&d->arch.dirty.lock); + return 0; +} + +/* get the dirty bitmap from the linked list stored by add_mapped_vaddr + * and also clear the mapped vaddrs linked list */ +static void get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) +{ + struct page_info *page_head; + struct mapped_va_node *mvn_head; + struct page_info *tmp_page; + struct mapped_va_node *tmp_mvn; + vaddr_t gma_start = 0; + vaddr_t gma_end = 0; + + /* this hypercall is called from domain 0, and we don't know which guest's + * vlpt is mapped in xen_second, so, to be sure, we restore vlpt here */ + restore_vlpt(d); + + spin_lock(&d->arch.dirty.lock); + + page_head = d->arch.dirty.head; + mvn_head = d->arch.dirty.mvn_head; + get_gma_start_end(d, &gma_start, &gma_end); + + while ( page_head ) + { + int i; + + for ( i = 0; i < mvn_head->items; ++i ) + { + unsigned int bit_offset; + int bitmap_index, bitmap_offset; + lpae_t *vlp2m_pte; + + bit_offset = third_linear_offset(mvn_head->vaddrs[i] - gma_start); + bitmap_index = bit_offset >> (PAGE_SHIFT + 3); + bitmap_offset = bit_offset & ((1ul << (PAGE_SHIFT + 3)) - 1); + + __set_bit(bitmap_offset, bitmap[bitmap_index]); + + vlp2m_pte = get_vlpt_3lvl_pte(mvn_head->vaddrs[i]); + vlp2m_pte->p2m.write = 0; + } + + tmp_page = page_head; + page_head = mvn_head->next; + + tmp_mvn = mvn_head; + mvn_head = mvn_head->mvn_next; + + unmap_domain_page_global(tmp_mvn); + free_domheap_page(tmp_page); + } + + d->arch.dirty.head = NULL; + d->arch.dirty.mvn_head = NULL; + + spin_unlock(&d->arch.dirty.lock); +} + + +/* Change types across all p2m entries in a domain */ +static void p2m_change_entry_type_global(struct domain *d, enum mg nt) +{ + struct p2m_domain *p2m = &d->arch.p2m; + vaddr_t ram_base; + int i1, i2, i3; + int first_index, second_index, third_index; + lpae_t *first = __map_domain_page(p2m->first_level); + lpae_t pte, *second = NULL, *third = NULL; + + get_gma_start_end(d, &ram_base, NULL); + + first_index = first_table_offset((uint64_t)ram_base); + second_index = second_table_offset((uint64_t)ram_base); + third_index = third_table_offset((uint64_t)ram_base); + + BUG_ON( !first && "Can't map first level p2m." ); + + spin_lock(&p2m->lock); + + for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 ) + { + lpae_walk_t first_pte = first[i1].walk; + if ( !first_pte.valid || !first_pte.table ) + goto out; + + second = map_domain_page(first_pte.base); + BUG_ON( !second && "Can't map second level p2m."); + for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 ) + { + lpae_walk_t second_pte = second[i2].walk; + + /* a nasty hack for vlpt due to the difference + * of page table entry layout between p2m and pt */ + second[i2].pt.ro = 0; + + if ( !second_pte.valid || !second_pte.table ) + goto out; + + third = map_domain_page(second_pte.base); + BUG_ON( !third && "Can't map third level p2m."); + + for ( i3 = third_index; i3 < LPAE_ENTRIES; ++i3 ) + { + lpae_walk_t third_pte = third[i3].walk; + int write = 0; + if ( !third_pte.valid ) + goto out; + + pte = third[i3]; + if ( pte.p2m.write == 1 && nt == mg_ro ) + { + pte.p2m.write = 0; + write = 1; + } + else if ( pte.p2m.write == 0 && nt == mg_rw ) + { + pte.p2m.write = 1; + write = 1; + } + if ( write ) + write_pte(&third[i3], pte); + } + unmap_domain_page(third); + + third = NULL; + third_index = 0; + } + unmap_domain_page(second); + + second = NULL; + second_index = 0; + third_index = 0; + } + +out: + flush_tlb_all_local(); + if ( third ) unmap_domain_page(third); + if ( second ) unmap_domain_page(second); + if ( first ) unmap_domain_page(first); + + spin_unlock(&p2m->lock); +} + +/* Read a domain's log-dirty bitmap and stats. + * If the operation is a CLEAN, clear the bitmap and stats. */ +int log_dirty_op(struct domain *d, xen_domctl_shadow_op_t *sc) +{ + unsigned long gmfn_start; + unsigned long gmfn_end; + unsigned long gmfns; + unsigned int bitmap_pages; + int rc = 0, clean = 0, peek = 1; + uint8_t *bitmap[256]; /* bitmap[256] covers 32GB ram */ + int i; + + BUG_ON( !d->arch.map_domain.nr_banks ); + + gmfn_start = d->arch.map_domain.bank[0].start >> PAGE_SHIFT; + gmfn_end = domain_get_maximum_gpfn(d); + gmfns = gmfn_end - gmfn_start; + bitmap_pages = PFN_UP((gmfns + 7) / 8); + + if ( guest_handle_is_null(sc->dirty_bitmap) ) + { + peek = 0; + } + else + { + /* mapping to the bitmap from guest param */ + vaddr_t to = (vaddr_t)sc->dirty_bitmap.p; + + BUG_ON( to & ~PAGE_MASK && "offset not aligned to PAGE SIZE"); + + for ( i = 0; i < bitmap_pages; ++i ) + { + paddr_t g; + rc = gvirt_to_maddr(to, &g); + if ( rc ) + return rc; + bitmap[i] = map_domain_page(g>>PAGE_SHIFT); + memset(bitmap[i], 0x00, PAGE_SIZE); + to += PAGE_SIZE; + } + } + + clean = (sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN); + sc->stats.dirty_count = d->arch.dirty.count; + + get_dirty_bitmap(d, bitmap); + if ( peek ) + { + for ( i = 0; i < bitmap_pages; ++i ) + { + unmap_domain_page(bitmap[i]); + } + } + + return 0; +} + +long dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc) +{ + long ret = 0; + switch (sc->op) + { + case XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY: + case XEN_DOMCTL_SHADOW_OP_OFF: + { + enum mg nt = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? mg_rw : mg_ro; + + d->arch.dirty.mode = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? 0 : 1; + p2m_change_entry_type_global(d, nt); + + if ( sc->op == XEN_DOMCTL_SHADOW_OP_OFF ) + cleanup_vlpt(d); + else + prepare_vlpt(d); + } + break; + + case XEN_DOMCTL_SHADOW_OP_CLEAN: + case XEN_DOMCTL_SHADOW_OP_PEEK: + { + ret = log_dirty_op(d, sc); + } + break; + + default: + return -ENOSYS; + } + return ret; +} + /* * Local variables: * mode: C diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c index 4c0fc32..3b78ed2 100644 --- a/xen/arch/arm/traps.c +++ b/xen/arch/arm/traps.c @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, const char *msg; int rc, level = -1; mmio_info_t info; + int page_fault = ((dabt.dfsc & FSC_MASK) == + (FSC_FLT_PERM + FSC_3D_LEVEL) && dabt.write); if ( !check_conditional_instr(regs, hsr) ) { @@ -1327,22 +1329,23 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, info.gva = READ_SYSREG64(FAR_EL2); #endif - if (dabt.s1ptw) + if ( dabt.s1ptw && !page_fault ) goto bad_data_abort; rc = gva_to_ipa(info.gva, &info.gpa); - if ( rc == -EFAULT ) + if ( rc == -EFAULT && !page_fault ) goto bad_data_abort; /* XXX: Decode the instruction if ISS is not valid */ - if ( !dabt.valid ) + if ( !dabt.valid && !page_fault ) goto bad_data_abort; /* * Erratum 766422: Thumb store translation fault to Hypervisor may * not have correct HSR Rt value. */ - if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write ) + if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write + && !page_fault) { rc = decode_instruction(regs, &info.dabt); if ( rc ) @@ -1358,6 +1361,16 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, return; } + /* handle permission fault on write */ + if ( page_fault ) + { + if ( current->domain->arch.dirty.mode == 0 ) + goto bad_data_abort; + + handle_page_fault(current->domain, info.gpa); + return; + } + bad_data_abort: msg = decode_fsc( dabt.dfsc, &level); diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h index 7e7743a..de349e9 100644 --- a/xen/include/asm-arm/domain.h +++ b/xen/include/asm-arm/domain.h @@ -114,6 +114,11 @@ struct arch_domain /* dirty-page tracing */ struct { + spinlock_t lock; + int mode; + unsigned int count; + struct page_info *head; /* maintain the mapped vaddrs */ + struct mapped_va_node *mvn_head; /* va of the head */ int second_lvl_start; /* for context switch */ int second_lvl_end; lpae_t *second_lvl; diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h index b9cbcf2..c0eade0 100644 --- a/xen/include/asm-arm/mm.h +++ b/xen/include/asm-arm/mm.h @@ -366,6 +366,10 @@ static inline lpae_t * get_vlpt_3lvl_pte(vaddr_t addr) (ipa_third << 3) ); } +/* dirty-page tracing */ +enum mg { mg_clear, mg_ro, mg_rw, mg_rx }; +void handle_page_fault(struct domain *d, paddr_t addr); + #endif /* __ARCH_ARM_MM__ */ /* * Local variables: diff --git a/xen/include/asm-arm/p2m.h b/xen/include/asm-arm/p2m.h index c660820..0bac332 100644 --- a/xen/include/asm-arm/p2m.h +++ b/xen/include/asm-arm/p2m.h @@ -2,6 +2,7 @@ #define _XEN_P2M_H #include <xen/mm.h> +#include <public/domctl.h> struct domain; @@ -110,6 +111,9 @@ static inline int get_page_and_type(struct page_info *page, return rc; } +long dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc); +int add_mapped_vaddr(struct domain *d, unsigned long va); + #endif /* _XEN_P2M_H */ /* diff --git a/xen/include/asm-arm/processor.h b/xen/include/asm-arm/processor.h index 5294421..fced6ad 100644 --- a/xen/include/asm-arm/processor.h +++ b/xen/include/asm-arm/processor.h @@ -399,6 +399,8 @@ union hsr { #define FSC_CPR (0x3a) /* Coprocossor Abort */ #define FSC_LL_MASK (0x03<<0) +#define FSC_MASK (0x3f) /* Fault status mask */ +#define FSC_3D_LEVEL (0x03) /* Third level fault*/ /* Time counter hypervisor control register */ #define CNTHCTL_PA (1u<<0) /* Kernel/user access to physical counter */ -- 1.8.1.2 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-04 4:44 ` [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing Jaeyong Yoo @ 2013-10-07 13:02 ` Julien Grall 2013-10-07 15:51 ` Eugene Fedotov 2013-10-16 13:44 ` Ian Campbell 1 sibling, 1 reply; 18+ messages in thread From: Julien Grall @ 2013-10-07 13:02 UTC (permalink / raw) To: Jaeyong Yoo; +Cc: xen-devel On 10/04/2013 05:44 AM, Jaeyong Yoo wrote: > Add hypercall (shadow op: enable/disable and clean/peek dirted page bitmap). > It consists of two parts: dirty page detecting and saving. > For detecting, we setup the guest p2m's leaf PTE read-only and whenever > the guest tries to write something, permission fault happens and traps into xen. > The permission-faulted GPA should be saved for the toolstack (when it wants to see > which pages are dirted). For this purpose, we temporarily save the GPAs into linked > list by using 'add_mapped_vaddr' function and when toolstack wants > (log_dirty_op function) the GPAs are copied into bitmap and the linnked list is flushed. > > Additionally, for supporting parallel migration of domUs, vlpt area should be context > switched. > > Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com> > --- [..] > diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c > index 4c0fc32..3b78ed2 100644 > --- a/xen/arch/arm/traps.c > +++ b/xen/arch/arm/traps.c > @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, > const char *msg; > int rc, level = -1; > mmio_info_t info; > + int page_fault = ((dabt.dfsc & FSC_MASK) == > + (FSC_FLT_PERM + FSC_3D_LEVEL) && dabt.write); > > if ( !check_conditional_instr(regs, hsr) ) > { > @@ -1327,22 +1329,23 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, > info.gva = READ_SYSREG64(FAR_EL2); > #endif > > - if (dabt.s1ptw) > + if ( dabt.s1ptw && !page_fault ) I think checking !page_fault is nearly everywhere is error-prone when this function will be modified. Can you do something like this? if ( page_fault ) // Your code to handle page fault else { // handle_mmio } It will avoid && !page_fault. > goto bad_data_abort; > > rc = gva_to_ipa(info.gva, &info.gpa); > - if ( rc == -EFAULT ) > + if ( rc == -EFAULT && !page_fault ) > goto bad_data_abort; > > /* XXX: Decode the instruction if ISS is not valid */ > - if ( !dabt.valid ) > + if ( !dabt.valid && !page_fault ) > goto bad_data_abort; > > /* > * Erratum 766422: Thumb store translation fault to Hypervisor may > * not have correct HSR Rt value. > */ > - if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write ) > + if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write > + && !page_fault) > { > rc = decode_instruction(regs, &info.dabt); > if ( rc ) > @@ -1358,6 +1361,16 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, > return; > } > > + /* handle permission fault on write */ > + if ( page_fault ) > + { > + if ( current->domain->arch.dirty.mode == 0 ) > + goto bad_data_abort; > + > + handle_page_fault(current->domain, info.gpa); You must call advance_pc(regs, hsr) here. > + return; > + } > + > bad_data_abort: -- Julien Grall ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-07 13:02 ` Julien Grall @ 2013-10-07 15:51 ` Eugene Fedotov 2013-10-10 14:10 ` Julien Grall 0 siblings, 1 reply; 18+ messages in thread From: Eugene Fedotov @ 2013-10-07 15:51 UTC (permalink / raw) To: xen-devel 07.10.2013 17:02, Julien Grall: > I think checking !page_fault is nearly everywhere is error-prone when > this function will be modified. > > Can you do something like this? > > if ( page_fault ) > // Your code to handle page fault > else > { > // handle_mmio > } > > It will avoid && !page_fault. Yes, but I think at page fault condition we should check whether address belong MMIO regions (page fault at MMIO addr is possible, but we don't need that memory to transfer) > >> > goto bad_data_abort; >> > >> > rc = gva_to_ipa(info.gva, &info.gpa); >> >- if ( rc == -EFAULT ) >> >+ if ( rc == -EFAULT && !page_fault ) >> > goto bad_data_abort; >> > >> > /* XXX: Decode the instruction if ISS is not valid */ >> >- if ( !dabt.valid ) >> >+ if ( !dabt.valid && !page_fault ) >> > goto bad_data_abort; >> > >> > /* >> > * Erratum 766422: Thumb store translation fault to Hypervisor may >> > * not have correct HSR Rt value. >> > */ >> >- if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write ) >> >+ if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write >> >+ && !page_fault) >> > { >> > rc = decode_instruction(regs, &info.dabt); >> > if ( rc ) >> >@@ -1358,6 +1361,16 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, >> > return; >> > } >> > >> >+ /* handle permission fault on write */ >> >+ if ( page_fault ) >> >+ { >> >+ if ( current->domain->arch.dirty.mode == 0 ) >> >+ goto bad_data_abort; >> >+ >> >+ handle_page_fault(current->domain, info.gpa); > You must call advance_pc(regs, hsr) here. > Let me explain. I think the difference between handle_page_fault and handle_mmio is that the ongoing memory operation (that trapped here) should be repeated after handle_page_fault (the page fault handler clears read-only bit), but should be stepped out after handle_mmio (to do this we call advance_pc to increase the pc register). So, advance_pc is intentionally missed. Best regards, Evgeny. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-07 15:51 ` Eugene Fedotov @ 2013-10-10 14:10 ` Julien Grall 0 siblings, 0 replies; 18+ messages in thread From: Julien Grall @ 2013-10-10 14:10 UTC (permalink / raw) To: Eugene Fedotov; +Cc: xen-devel On 10/07/2013 04:51 PM, Eugene Fedotov wrote: > Let me explain. I think the difference between handle_page_fault and > handle_mmio is that the ongoing memory operation (that trapped here) > should be repeated after handle_page_fault (the page fault handler > clears read-only bit), but should be stepped out after handle_mmio (to > do this we call advance_pc to increase the pc register). So, advance_pc > is intentionally missed. Oh right, thanks for the explanation! -- Julien Grall ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-04 4:44 ` [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing Jaeyong Yoo 2013-10-07 13:02 ` Julien Grall @ 2013-10-16 13:44 ` Ian Campbell 2013-10-17 10:12 ` Jaeyong Yoo 1 sibling, 1 reply; 18+ messages in thread From: Ian Campbell @ 2013-10-16 13:44 UTC (permalink / raw) To: Jaeyong Yoo; +Cc: Stefano Stabellini, xen-devel On Fri, 2013-10-04 at 13:44 +0900, Jaeyong Yoo wrote: > Add hypercall (shadow op: enable/disable and clean/peek dirted page bitmap). "dirtied" > It consists of two parts: dirty page detecting and saving. > For detecting, we setup the guest p2m's leaf PTE read-only and whenever > the guest tries to write something, permission fault happens and traps into xen. > The permission-faulted GPA should be saved for the toolstack (when it wants to see > which pages are dirted). For this purpose, we temporarily save the GPAs into linked "dirtied" again > list by using 'add_mapped_vaddr' function and when toolstack wants > (log_dirty_op function) the GPAs are copied into bitmap and the linnked list is flushed. "linked" > spin_lock_init(&d->arch.map_lock); > d->arch.map_domain.nr_banks = 0; > > + /* init for dirty-page tracing */ > + d->arch.dirty.count = 0; > + d->arch.dirty.mode = 0; > + d->arch.dirty.head = NULL; > + d->arch.dirty.mvn_head = NULL; > + spin_lock_init(&d->arch.dirty.lock); > + > + d->arch.dirty.second_lvl_start = 0; > + d->arch.dirty.second_lvl_end = 0; > + d->arch.dirty.second_lvl = NULL; I think some/all of these belong in the previous patch which added the fields. > + > clear_page(d->shared_info); > share_xen_page_with_guest( > virt_to_page(d->shared_info), d, XENSHARE_writable); > diff --git a/xen/arch/arm/domctl.c b/xen/arch/arm/domctl.c > index 78fb322..7edce12 100644 > --- a/xen/arch/arm/domctl.c > +++ b/xen/arch/arm/domctl.c > @@ -92,6 +92,19 @@ long arch_do_domctl(struct xen_domctl *domctl, struct domain *d, > xfree(c.data); > } > break; > + case XEN_DOMCTL_shadow_op: > + { > + domain_pause(d); > + ret = dirty_mode_op(d, &domctl->u.shadow_op); > + domain_unpause(d); > + > + if ( (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_CLEAN || > + (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_PEEK ) (&domctl->u.shadow_op)->op is just a weird way to write domctl->u.shadow_op.op isn't it? Do you only want to copyback on success or is it always correct? > + { > + copyback = 1; > + } > + } > + break; > @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d) > unmap_domain_page_global(d->arch.dirty.second_lvl); > } > > +/* routine for dirty-page tracing > + * > + * On first write, it page faults, its entry is changed to read-write, > + * and on retry the write succeeds. > + * > + * for locating p2m of the faulting entry, we use virtual-linear page table. > + */ > +void handle_page_fault(struct domain *d, paddr_t addr) > +{ > + lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr); > + if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 ) I think you need to distinguish p2m entries which are r/o due to log dirty from those which are r/o due to wanting them to be r/o. Maybe we don't have the latter right now? We will eventually need p2m types I think. > + { > + lpae_t pte = *vlp2m_pte; > + pte.p2m.write = 1; > + write_pte(vlp2m_pte, pte); > + flush_tlb_local(); > + > + /* in order to remove mappings in cleanup stage */ > + add_mapped_vaddr(d, addr); No locks or atomic operations here? How are races with the tools reading the dirty bitmap addressed? If it is by clever ordering of the checks and pte writes then I think a comment would be in order. > + } > +} > + > /* > * Local variables: > * mode: C > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c > index 2d09fef..3b2b11a 100644 > --- a/xen/arch/arm/p2m.c > +++ b/xen/arch/arm/p2m.c > @@ -6,6 +6,8 @@ > #include <xen/bitops.h> > #include <asm/flushtlb.h> > #include <asm/gic.h> > +#include <xen/guest_access.h> > +#include <xen/pfn.h> > > void dump_p2m_lookup(struct domain *d, paddr_t addr) > { > @@ -408,6 +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long gpfn) > return p >> PAGE_SHIFT; > } > > +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - sizeof(int))\ > + / sizeof(unsigned long) > + > +/* An array-based linked list for storing virtual addresses (i.e., the 3rd > + * level table mapping) that should be destroyed after live migration */ > +struct mapped_va_node > +{ > + struct page_info *next; > + struct mapped_va_node *mvn_next; > + int items; > + unsigned long vaddrs[MAX_VA_PER_NODE]; > +}; > + > +int add_mapped_vaddr(struct domain *d, unsigned long va) > +{ This function seems awefuly expensive (contains allocations etc) for a function called on every page fault during a migration. Why are you remembering the full va of every fault when a single bit per page would do? I think you can allocate a bitmap when logdirty is enabled. At a bit per page it shouldn't be too huge. You might be able to encode this in the p2m which you walk when the tools ask for the bit map (i.e. p2m.type==writeable p2m.write==0 => dirty), but I think a bitmap would do the trick and be easier to implement. > + struct page_info *page_head; > + struct mapped_va_node *mvn_head; > + > + spin_lock(&d->arch.dirty.lock); > + > + page_head = d->arch.dirty.head; > + mvn_head = d->arch.dirty.mvn_head; > + if ( !mvn_head ) > + { > + page_head = alloc_domheap_page(NULL, 0); > + if ( !page_head ) > + { > + spin_unlock(&d->arch.dirty.lock); > + return -ENOMEM; > + } > + > + mvn_head = map_domain_page_global(__page_to_mfn(page_head)); > + mvn_head->items = 0; > + mvn_head->next = NULL; > + mvn_head->mvn_next = NULL; > + d->arch.dirty.head = page_head; > + d->arch.dirty.mvn_head = mvn_head; > + } > + > + if ( mvn_head->items == MAX_VA_PER_NODE ) > + { > + struct page_info *page; > + > + page = alloc_domheap_page(NULL, 0); > + if ( !page ) > + { > + spin_unlock(&d->arch.dirty.lock); > + return -ENOMEM; > + } > + > + mvn_head = map_domain_page_global(__page_to_mfn(page)); > + mvn_head->items = 0; > + mvn_head->next = d->arch.dirty.head; > + mvn_head->mvn_next = d->arch.dirty.mvn_head; > + > + d->arch.dirty.mvn_head = mvn_head; > + d->arch.dirty.head = page; > + } > + > + mvn_head->vaddrs[mvn_head->items] = va; > + mvn_head->items ++; > + > + spin_unlock(&d->arch.dirty.lock); > + return 0; > +} > + > +/* get the dirty bitmap from the linked list stored by add_mapped_vaddr > + * and also clear the mapped vaddrs linked list */ > +static void get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) > +{ > + struct page_info *page_head; > + struct mapped_va_node *mvn_head; > + struct page_info *tmp_page; > + struct mapped_va_node *tmp_mvn; > + vaddr_t gma_start = 0; > + vaddr_t gma_end = 0; > + > + /* this hypercall is called from domain 0, and we don't know which guest's > + * vlpt is mapped in xen_second, so, to be sure, we restore vlpt here */ > + restore_vlpt(d); AFAICT you don't actually use the vlpt here, just the domains list of dirty addresses (which as above could become a bitmap) > + > + spin_lock(&d->arch.dirty.lock); > + > + page_head = d->arch.dirty.head; > + mvn_head = d->arch.dirty.mvn_head; > + get_gma_start_end(d, &gma_start, &gma_end); > + > + while ( page_head ) > + { > + int i; > + > + for ( i = 0; i < mvn_head->items; ++i ) > + { > + unsigned int bit_offset; > + int bitmap_index, bitmap_offset; > + lpae_t *vlp2m_pte; > + > + bit_offset = third_linear_offset(mvn_head->vaddrs[i] - gma_start); > + bitmap_index = bit_offset >> (PAGE_SHIFT + 3); > + bitmap_offset = bit_offset & ((1ul << (PAGE_SHIFT + 3)) - 1); > + > + __set_bit(bitmap_offset, bitmap[bitmap_index]); > + > + vlp2m_pte = get_vlpt_3lvl_pte(mvn_head->vaddrs[i]); > + vlp2m_pte->p2m.write = 0; > + } > + > + tmp_page = page_head; > + page_head = mvn_head->next; > + > + tmp_mvn = mvn_head; > + mvn_head = mvn_head->mvn_next; > + > + unmap_domain_page_global(tmp_mvn); > + free_domheap_page(tmp_page); > + } > + > + d->arch.dirty.head = NULL; > + d->arch.dirty.mvn_head = NULL; > + > + spin_unlock(&d->arch.dirty.lock); > +} > + > + > +/* Change types across all p2m entries in a domain */ > +static void p2m_change_entry_type_global(struct domain *d, enum mg nt) > +{ Stefano had some patches to generalise create_p2m_entries() into a p2m walker infrastructure in his series to add XENMEM_exchange_and_pin. That series turned out to be unnecessary but it could be resurrected for use here rather than recoding a new one. > + struct p2m_domain *p2m = &d->arch.p2m; > + vaddr_t ram_base; > + int i1, i2, i3; > + int first_index, second_index, third_index; > + lpae_t *first = __map_domain_page(p2m->first_level); > + lpae_t pte, *second = NULL, *third = NULL; > + > + get_gma_start_end(d, &ram_base, NULL); > + > + first_index = first_table_offset((uint64_t)ram_base); > + second_index = second_table_offset((uint64_t)ram_base); > + third_index = third_table_offset((uint64_t)ram_base); > + > + BUG_ON( !first && "Can't map first level p2m." ); > + > + spin_lock(&p2m->lock); > + > + for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 ) > + { > + lpae_walk_t first_pte = first[i1].walk; > + if ( !first_pte.valid || !first_pte.table ) > + goto out; > + > + second = map_domain_page(first_pte.base); > + BUG_ON( !second && "Can't map second level p2m."); > + for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 ) > + { > + lpae_walk_t second_pte = second[i2].walk; > + > + /* a nasty hack for vlpt due to the difference > + * of page table entry layout between p2m and pt */ > + second[i2].pt.ro = 0; What is the hack here? The issue is that the p2m second level, which is a table entry in the p2m is installed into the Xen page tables where it ends up the third level, treated as a block entry. That's OK, because for both PT and P2M bits 2..10 are ignored for table entries. So I think rather than this hack here we should simply ensure that our non-leaf p2m's have the correct bits in them to be used as p2m table entries. > + > + if ( !second_pte.valid || !second_pte.table ) > + goto out; > + > + third = map_domain_page(second_pte.base); > + BUG_ON( !third && "Can't map third level p2m."); > + > + for ( i3 = third_index; i3 < LPAE_ENTRIES; ++i3 ) > + { > + lpae_walk_t third_pte = third[i3].walk; > + int write = 0; > + if ( !third_pte.valid ) > + goto out; > + > + pte = third[i3]; > + if ( pte.p2m.write == 1 && nt == mg_ro ) > + { > + pte.p2m.write = 0; > + write = 1; > + } > + else if ( pte.p2m.write == 0 && nt == mg_rw ) > + { > + pte.p2m.write = 1; > + write = 1; > + } > + if ( write ) > + write_pte(&third[i3], pte); > + } > + unmap_domain_page(third); > + > + third = NULL; > + third_index = 0; > + } > + unmap_domain_page(second); > + > + second = NULL; > + second_index = 0; > + third_index = 0; > + } > + > +out: > + flush_tlb_all_local(); > + if ( third ) unmap_domain_page(third); > + if ( second ) unmap_domain_page(second); > + if ( first ) unmap_domain_page(first); > + > + spin_unlock(&p2m->lock); > +} > + > +/* Read a domain's log-dirty bitmap and stats. > + * If the operation is a CLEAN, clear the bitmap and stats. */ > +int log_dirty_op(struct domain *d, xen_domctl_shadow_op_t *sc) Can this not be static? > +{ > + unsigned long gmfn_start; > + unsigned long gmfn_end; > + unsigned long gmfns; > + unsigned int bitmap_pages; > + int rc = 0, clean = 0, peek = 1; > + uint8_t *bitmap[256]; /* bitmap[256] covers 32GB ram */ > + int i; > + > + BUG_ON( !d->arch.map_domain.nr_banks ); > + > + gmfn_start = d->arch.map_domain.bank[0].start >> PAGE_SHIFT; > + gmfn_end = domain_get_maximum_gpfn(d); > + gmfns = gmfn_end - gmfn_start; > + bitmap_pages = PFN_UP((gmfns + 7) / 8); > + > + if ( guest_handle_is_null(sc->dirty_bitmap) ) > + { > + peek = 0; > + } > + else > + { > + /* mapping to the bitmap from guest param */ > + vaddr_t to = (vaddr_t)sc->dirty_bitmap.p; This is not allowed, please use the guest handle interfaces and copy_(to/from)_(user/guest) instead of diving into the guest handle struct and open coding it yourself. This might mean creating a temporary bitmap in hypervisor space, but if you maintain the bitmap across the whole dirty period as suggested then you should be able to simply copy it out. Actually, for consistency you might need an atomic copy and clear mechanism (otherwise you can perhaps loose updates), in which case you'll need a temporary buffer. > + > + BUG_ON( to & ~PAGE_MASK && "offset not aligned to PAGE SIZE"); > + > + for ( i = 0; i < bitmap_pages; ++i ) > + { > + paddr_t g; > + rc = gvirt_to_maddr(to, &g); > + if ( rc ) > + return rc; > + bitmap[i] = map_domain_page(g>>PAGE_SHIFT); > + memset(bitmap[i], 0x00, PAGE_SIZE); > + to += PAGE_SIZE; > + } > + } > + > + clean = (sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN); > + sc->stats.dirty_count = d->arch.dirty.count; > + > + get_dirty_bitmap(d, bitmap); > + if ( peek ) > + { > + for ( i = 0; i < bitmap_pages; ++i ) > + { > + unmap_domain_page(bitmap[i]); > + } > + } > + > + return 0; > +} > + > +long dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc) > +{ > + long ret = 0; > + switch (sc->op) > + { > + case XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY: > + case XEN_DOMCTL_SHADOW_OP_OFF: > + { > + enum mg nt = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? mg_rw : mg_ro; > + > + d->arch.dirty.mode = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? 0 : 1; > + p2m_change_entry_type_global(d, nt); > + > + if ( sc->op == XEN_DOMCTL_SHADOW_OP_OFF ) > + cleanup_vlpt(d); > + else > + prepare_vlpt(d); > + } > + break; > + > + case XEN_DOMCTL_SHADOW_OP_CLEAN: > + case XEN_DOMCTL_SHADOW_OP_PEEK: > + { > + ret = log_dirty_op(d, sc); > + } > + break; > + > + default: > + return -ENOSYS; > + } > + return ret; > +} > + > /* > * Local variables: > * mode: C > diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c > index 4c0fc32..3b78ed2 100644 > --- a/xen/arch/arm/traps.c > +++ b/xen/arch/arm/traps.c > @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, > const char *msg; > int rc, level = -1; > mmio_info_t info; > + int page_fault = ((dabt.dfsc & FSC_MASK) == > + (FSC_FLT_PERM + FSC_3D_LEVEL) && dabt.write); > > if ( !check_conditional_instr(regs, hsr) ) > { > @@ -1327,22 +1329,23 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs, > info.gva = READ_SYSREG64(FAR_EL2); > #endif > > - if (dabt.s1ptw) > + if ( dabt.s1ptw && !page_fault ) > goto bad_data_abort; I see this has been commented on elsewhere. Ian. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-16 13:44 ` Ian Campbell @ 2013-10-17 10:12 ` Jaeyong Yoo 2013-10-17 10:30 ` Ian Campbell 0 siblings, 1 reply; 18+ messages in thread From: Jaeyong Yoo @ 2013-10-17 10:12 UTC (permalink / raw) To: 'Ian Campbell'; +Cc: 'Stefano Stabellini', xen-devel > -----Original Message----- > From: Ian Campbell [mailto:Ian.Campbell@citrix.com] > Sent: Wednesday, October 16, 2013 10:45 PM > To: Jaeyong Yoo > Cc: xen-devel@lists.xen.org; Stefano Stabellini > Subject: Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for > dirty page tracing > > On Fri, 2013-10-04 at 13:44 +0900, Jaeyong Yoo wrote: > > Add hypercall (shadow op: enable/disable and clean/peek dirted page > bitmap). > > "dirtied" > > > It consists of two parts: dirty page detecting and saving. > > For detecting, we setup the guest p2m's leaf PTE read-only and > > whenever the guest tries to write something, permission fault happens > and traps into xen. > > The permission-faulted GPA should be saved for the toolstack (when it > > wants to see which pages are dirted). For this purpose, we temporarily > > save the GPAs into linked > > "dirtied" again > > > list by using 'add_mapped_vaddr' function and when toolstack wants > > (log_dirty_op function) the GPAs are copied into bitmap and the linnked > list is flushed. > > "linked" Oops, typos. > > > spin_lock_init(&d->arch.map_lock); > > d->arch.map_domain.nr_banks = 0; > > > > + /* init for dirty-page tracing */ > > + d->arch.dirty.count = 0; > > + d->arch.dirty.mode = 0; > > + d->arch.dirty.head = NULL; > > + d->arch.dirty.mvn_head = NULL; > > + spin_lock_init(&d->arch.dirty.lock); > > + > > + d->arch.dirty.second_lvl_start = 0; > > + d->arch.dirty.second_lvl_end = 0; > > + d->arch.dirty.second_lvl = NULL; > > I think some/all of these belong in the previous patch which added the > fields. Oops, right. > > > + > > clear_page(d->shared_info); > > share_xen_page_with_guest( > > virt_to_page(d->shared_info), d, XENSHARE_writable); diff > > --git a/xen/arch/arm/domctl.c b/xen/arch/arm/domctl.c index > > 78fb322..7edce12 100644 > > --- a/xen/arch/arm/domctl.c > > +++ b/xen/arch/arm/domctl.c > > @@ -92,6 +92,19 @@ long arch_do_domctl(struct xen_domctl *domctl, struct > domain *d, > > xfree(c.data); > > } > > break; > > + case XEN_DOMCTL_shadow_op: > > + { > > + domain_pause(d); > > + ret = dirty_mode_op(d, &domctl->u.shadow_op); > > + domain_unpause(d); > > + > > + if ( (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_CLEAN || > > + (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_PEEK > > + ) > > (&domctl->u.shadow_op)->op is just a weird way to write > domctl->u.shadow_op.op isn't it? Right! > > Do you only want to copyback on success or is it always correct? Oh, I think I missed the else condition here. > > > + { > > + copyback = 1; > > + } > > + } > > + break; > > > > @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d) > > unmap_domain_page_global(d->arch.dirty.second_lvl); > > } > > > > +/* routine for dirty-page tracing > > + * > > + * On first write, it page faults, its entry is changed to > > +read-write, > > + * and on retry the write succeeds. > > + * > > + * for locating p2m of the faulting entry, we use virtual-linear page > table. > > + */ > > +void handle_page_fault(struct domain *d, paddr_t addr) { > > + lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr); > > + if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 ) > > I think you need to distinguish p2m entries which are r/o due to log dirty > from those which are r/o due to wanting them to be r/o. > > Maybe we don't have the latter right now? We will eventually need p2m > types I think. Yes, that should be distinguished. Actually, that was in my mind and apparently I failed to remember. For doing this, I'm thinking to use un-used field in pte as an indicator of 'wanting to be r/o' and check this indicator in permission fault. > > > + { > > + lpae_t pte = *vlp2m_pte; > > + pte.p2m.write = 1; > > + write_pte(vlp2m_pte, pte); > > + flush_tlb_local(); > > + > > + /* in order to remove mappings in cleanup stage */ > > + add_mapped_vaddr(d, addr); > > No locks or atomic operations here? How are races with the tools reading > the dirty bitmap addressed? If it is by clever ordering of the checks and > pte writes then I think a comment would be in order. Actually, the lock is inside the add_mapped_vaddr function. > > > + } > > +} > > + > > /* > > * Local variables: > > * mode: C > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index > > 2d09fef..3b2b11a 100644 > > --- a/xen/arch/arm/p2m.c > > +++ b/xen/arch/arm/p2m.c > > @@ -6,6 +6,8 @@ > > #include <xen/bitops.h> > > #include <asm/flushtlb.h> > > #include <asm/gic.h> > > +#include <xen/guest_access.h> > > +#include <xen/pfn.h> > > > > void dump_p2m_lookup(struct domain *d, paddr_t addr) { @@ -408,6 > > +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long > gpfn) > > return p >> PAGE_SHIFT; > > } > > > > +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - sizeof(int))\ > > + / sizeof(unsigned long) > > + > > +/* An array-based linked list for storing virtual addresses (i.e., > > +the 3rd > > + * level table mapping) that should be destroyed after live migration > > +*/ struct mapped_va_node { > > + struct page_info *next; > > + struct mapped_va_node *mvn_next; > > + int items; > > + unsigned long vaddrs[MAX_VA_PER_NODE]; }; > > + > > +int add_mapped_vaddr(struct domain *d, unsigned long va) { > > This function seems awefuly expensive (contains allocations etc) for a > function called on every page fault during a migration. > > Why are you remembering the full va of every fault when a single bit per > page would do? > > I think you can allocate a bitmap when logdirty is enabled. At a bit per > page it shouldn't be too huge. > > You might be able to encode this in the p2m which you walk when the tools > ask for the bit map (i.e. p2m.type==writeable p2m.write==0 => dirty), but > I think a bitmap would do the trick and be easier to implement. There are three candidates: 1) bitmap 2) encoding into p2m 3) linked list of VAs. And, two functions for dirty-page tracing: A) dirty-page handling (At trap) B) getting dirty bitmap (for toolstack) 1) and 2) have advantage for A) but have to search the full range of pages at B) to find out which page is dirtied and set the page as read-only for next dirty detection. On the otherhand, 3) does not need to search the full range at B). I prefer 3) due to the 'non-full range search' but I understand your concern. Maybe I can do some measurement for each method to see which one better. > > > + struct page_info *page_head; > > + struct mapped_va_node *mvn_head; > > + > > + spin_lock(&d->arch.dirty.lock); > > + > > + page_head = d->arch.dirty.head; > > + mvn_head = d->arch.dirty.mvn_head; > > + if ( !mvn_head ) > > + { > > + page_head = alloc_domheap_page(NULL, 0); > > + if ( !page_head ) > > + { > > + spin_unlock(&d->arch.dirty.lock); > > + return -ENOMEM; > > + } > > + > > + mvn_head = map_domain_page_global(__page_to_mfn(page_head)); > > + mvn_head->items = 0; > > + mvn_head->next = NULL; > > + mvn_head->mvn_next = NULL; > > + d->arch.dirty.head = page_head; > > + d->arch.dirty.mvn_head = mvn_head; > > + } > > + > > + if ( mvn_head->items == MAX_VA_PER_NODE ) > > + { > > + struct page_info *page; > > + > > + page = alloc_domheap_page(NULL, 0); > > + if ( !page ) > > + { > > + spin_unlock(&d->arch.dirty.lock); > > + return -ENOMEM; > > + } > > + > > + mvn_head = map_domain_page_global(__page_to_mfn(page)); > > + mvn_head->items = 0; > > + mvn_head->next = d->arch.dirty.head; > > + mvn_head->mvn_next = d->arch.dirty.mvn_head; > > + > > + d->arch.dirty.mvn_head = mvn_head; > > + d->arch.dirty.head = page; > > + } > > + > > + mvn_head->vaddrs[mvn_head->items] = va; > > + mvn_head->items ++; > > + > > + spin_unlock(&d->arch.dirty.lock); > > + return 0; > > +} > > + > > +/* get the dirty bitmap from the linked list stored by > > +add_mapped_vaddr > > + * and also clear the mapped vaddrs linked list */ static void > > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) { > > + struct page_info *page_head; > > + struct mapped_va_node *mvn_head; > > + struct page_info *tmp_page; > > + struct mapped_va_node *tmp_mvn; > > + vaddr_t gma_start = 0; > > + vaddr_t gma_end = 0; > > + > > + /* this hypercall is called from domain 0, and we don't know which > guest's > > + * vlpt is mapped in xen_second, so, to be sure, we restore vlpt > here */ > > + restore_vlpt(d); > > AFAICT you don't actually use the vlpt here, just the domains list of > dirty addresses (which as above could become a bitmap) I think vlpt is still needed because at this stage, we have to reset the write permission of dirtied pages. Maybe some tricks for efficiently doing this? > > > + > > + spin_lock(&d->arch.dirty.lock); > > + > > + page_head = d->arch.dirty.head; > > + mvn_head = d->arch.dirty.mvn_head; > > + get_gma_start_end(d, &gma_start, &gma_end); > > + > > + while ( page_head ) > > + { > > + int i; > > + > > + for ( i = 0; i < mvn_head->items; ++i ) > > + { > > + unsigned int bit_offset; > > + int bitmap_index, bitmap_offset; > > + lpae_t *vlp2m_pte; > > + > > + bit_offset = third_linear_offset(mvn_head->vaddrs[i] - > gma_start); > > + bitmap_index = bit_offset >> (PAGE_SHIFT + 3); > > + bitmap_offset = bit_offset & ((1ul << (PAGE_SHIFT + 3)) - > > + 1); > > + > > + __set_bit(bitmap_offset, bitmap[bitmap_index]); > > + > > + vlp2m_pte = get_vlpt_3lvl_pte(mvn_head->vaddrs[i]); > > + vlp2m_pte->p2m.write = 0; > > + } > > + > > + tmp_page = page_head; > > + page_head = mvn_head->next; > > + > > + tmp_mvn = mvn_head; > > + mvn_head = mvn_head->mvn_next; > > + > > + unmap_domain_page_global(tmp_mvn); > > + free_domheap_page(tmp_page); > > + } > > + > > + d->arch.dirty.head = NULL; > > + d->arch.dirty.mvn_head = NULL; > > + > > + spin_unlock(&d->arch.dirty.lock); } > > + > > + > > +/* Change types across all p2m entries in a domain */ static void > > +p2m_change_entry_type_global(struct domain *d, enum mg nt) { > > Stefano had some patches to generalise create_p2m_entries() into a p2m > walker infrastructure in his series to add XENMEM_exchange_and_pin. That > series turned out to be unnecessary but it could be resurrected for use > here rather than recoding a new one. p2m walker infrastructure sounds interesting. > > > + struct p2m_domain *p2m = &d->arch.p2m; > > + vaddr_t ram_base; > > + int i1, i2, i3; > > + int first_index, second_index, third_index; > > + lpae_t *first = __map_domain_page(p2m->first_level); > > + lpae_t pte, *second = NULL, *third = NULL; > > + > > + get_gma_start_end(d, &ram_base, NULL); > > + > > + first_index = first_table_offset((uint64_t)ram_base); > > + second_index = second_table_offset((uint64_t)ram_base); > > + third_index = third_table_offset((uint64_t)ram_base); > > + > > + BUG_ON( !first && "Can't map first level p2m." ); > > + > > + spin_lock(&p2m->lock); > > + > > + for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 ) > > + { > > + lpae_walk_t first_pte = first[i1].walk; > > + if ( !first_pte.valid || !first_pte.table ) > > + goto out; > > + > > + second = map_domain_page(first_pte.base); > > + BUG_ON( !second && "Can't map second level p2m."); > > + for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 ) > > + { > > + lpae_walk_t second_pte = second[i2].walk; > > + > > + /* a nasty hack for vlpt due to the difference > > + * of page table entry layout between p2m and pt */ > > + second[i2].pt.ro = 0; > > What is the hack here? > > The issue is that the p2m second level, which is a table entry in the p2m > is installed into the Xen page tables where it ends up the third level, > treated as a block entry. > > That's OK, because for both PT and P2M bits 2..10 are ignored for table > entries. So I think rather than this hack here we should simply ensure > that our non-leaf p2m's have the correct bits in them to be used as p2m > table entries. Oh, I see. So, should I put something like this in create_p2m_entries function? > > > + > > + if ( !second_pte.valid || !second_pte.table ) > > + goto out; > > + > > + third = map_domain_page(second_pte.base); > > + BUG_ON( !third && "Can't map third level p2m."); > > + > > + for ( i3 = third_index; i3 < LPAE_ENTRIES; ++i3 ) > > + { > > + lpae_walk_t third_pte = third[i3].walk; > > + int write = 0; > > + if ( !third_pte.valid ) > > + goto out; > > + > > + pte = third[i3]; > > + if ( pte.p2m.write == 1 && nt == mg_ro ) > > + { > > + pte.p2m.write = 0; > > + write = 1; > > + } > > + else if ( pte.p2m.write == 0 && nt == mg_rw ) > > + { > > + pte.p2m.write = 1; > > + write = 1; > > + } > > + if ( write ) > > + write_pte(&third[i3], pte); > > + } > > + unmap_domain_page(third); > > + > > + third = NULL; > > + third_index = 0; > > + } > > + unmap_domain_page(second); > > + > > + second = NULL; > > + second_index = 0; > > + third_index = 0; > > + } > > + > > +out: > > + flush_tlb_all_local(); > > + if ( third ) unmap_domain_page(third); > > + if ( second ) unmap_domain_page(second); > > + if ( first ) unmap_domain_page(first); > > + > > + spin_unlock(&p2m->lock); > > +} > > + > > +/* Read a domain's log-dirty bitmap and stats. > > + * If the operation is a CLEAN, clear the bitmap and stats. */ int > > +log_dirty_op(struct domain *d, xen_domctl_shadow_op_t *sc) > > Can this not be static? Yes, right. > > > +{ > > + unsigned long gmfn_start; > > + unsigned long gmfn_end; > > + unsigned long gmfns; > > + unsigned int bitmap_pages; > > + int rc = 0, clean = 0, peek = 1; > > + uint8_t *bitmap[256]; /* bitmap[256] covers 32GB ram */ > > + int i; > > + > > + BUG_ON( !d->arch.map_domain.nr_banks ); > > + > > + gmfn_start = d->arch.map_domain.bank[0].start >> PAGE_SHIFT; > > + gmfn_end = domain_get_maximum_gpfn(d); > > + gmfns = gmfn_end - gmfn_start; > > + bitmap_pages = PFN_UP((gmfns + 7) / 8); > > + > > + if ( guest_handle_is_null(sc->dirty_bitmap) ) > > + { > > + peek = 0; > > + } > > + else > > + { > > + /* mapping to the bitmap from guest param */ > > + vaddr_t to = (vaddr_t)sc->dirty_bitmap.p; > > This is not allowed, please use the guest handle interfaces and > copy_(to/from)_(user/guest) instead of diving into the guest handle struct > and open coding it yourself. > > This might mean creating a temporary bitmap in hypervisor space, but if > you maintain the bitmap across the whole dirty period as suggested then > you should be able to simply copy it out. Oh, I'm seeing more advantages for using bitmap. > > Actually, for consistency you might need an atomic copy and clear > mechanism (otherwise you can perhaps loose updates), in which case you'll > need a temporary buffer. > > > + > > + BUG_ON( to & ~PAGE_MASK && "offset not aligned to PAGE > > + SIZE"); > > + > > + for ( i = 0; i < bitmap_pages; ++i ) > > + { > > + paddr_t g; > > + rc = gvirt_to_maddr(to, &g); > > + if ( rc ) > > + return rc; > > + bitmap[i] = map_domain_page(g>>PAGE_SHIFT); > > + memset(bitmap[i], 0x00, PAGE_SIZE); > > + to += PAGE_SIZE; > > + } > > + } > > + > > + clean = (sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN); > > + sc->stats.dirty_count = d->arch.dirty.count; > > + > > + get_dirty_bitmap(d, bitmap); > > + if ( peek ) > > + { > > + for ( i = 0; i < bitmap_pages; ++i ) > > + { > > + unmap_domain_page(bitmap[i]); > > + } > > + } > > + > > + return 0; > > +} > > + > > +long dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc) { > > + long ret = 0; > > + switch (sc->op) > > + { > > + case XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY: > > + case XEN_DOMCTL_SHADOW_OP_OFF: > > + { > > + enum mg nt = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? mg_rw : > > +mg_ro; > > + > > + d->arch.dirty.mode = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? 0 : > 1; > > + p2m_change_entry_type_global(d, nt); > > + > > + if ( sc->op == XEN_DOMCTL_SHADOW_OP_OFF ) > > + cleanup_vlpt(d); > > + else > > + prepare_vlpt(d); > > + } > > + break; > > + > > + case XEN_DOMCTL_SHADOW_OP_CLEAN: > > + case XEN_DOMCTL_SHADOW_OP_PEEK: > > + { > > + ret = log_dirty_op(d, sc); > > + } > > + break; > > + > > + default: > > + return -ENOSYS; > > + } > > + return ret; > > +} > > + > > /* > > * Local variables: > > * mode: C > > diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c index > > 4c0fc32..3b78ed2 100644 > > --- a/xen/arch/arm/traps.c > > +++ b/xen/arch/arm/traps.c > > @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct > cpu_user_regs *regs, > > const char *msg; > > int rc, level = -1; > > mmio_info_t info; > > + int page_fault = ((dabt.dfsc & FSC_MASK) == > > + (FSC_FLT_PERM + FSC_3D_LEVEL) && dabt.write); > > > > if ( !check_conditional_instr(regs, hsr) ) > > { > > @@ -1327,22 +1329,23 @@ static void do_trap_data_abort_guest(struct > cpu_user_regs *regs, > > info.gva = READ_SYSREG64(FAR_EL2); #endif > > > > - if (dabt.s1ptw) > > + if ( dabt.s1ptw && !page_fault ) > > goto bad_data_abort; > > I see this has been commented on elsewhere. > > Ian. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-17 10:12 ` Jaeyong Yoo @ 2013-10-17 10:30 ` Ian Campbell 2013-10-17 11:05 ` Jaeyong Yoo 0 siblings, 1 reply; 18+ messages in thread From: Ian Campbell @ 2013-10-17 10:30 UTC (permalink / raw) To: Jaeyong Yoo; +Cc: 'Stefano Stabellini', xen-devel On Thu, 2013-10-17 at 19:12 +0900, Jaeyong Yoo wrote: > > > @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d) > > > unmap_domain_page_global(d->arch.dirty.second_lvl); > > > } > > > > > > +/* routine for dirty-page tracing > > > + * > > > + * On first write, it page faults, its entry is changed to > > > +read-write, > > > + * and on retry the write succeeds. > > > + * > > > + * for locating p2m of the faulting entry, we use virtual-linear page > > table. > > > + */ > > > +void handle_page_fault(struct domain *d, paddr_t addr) { > > > + lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr); > > > + if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 ) > > > > I think you need to distinguish p2m entries which are r/o due to log dirty > > from those which are r/o due to wanting them to be r/o. > > > > Maybe we don't have the latter right now? We will eventually need p2m > > types I think. > > Yes, that should be distinguished. Actually, that was in my mind and apparently > I failed to remember. For doing this, I'm thinking to use un-used field in pte > as an indicator of 'wanting to be r/o' and check this indicator in permission fault. Yes, I think we can use the 4 avail bits to store a p2m type. ISTR discussing this before, but perhaps it wasn't with you! > > > > > > + { > > > + lpae_t pte = *vlp2m_pte; > > > + pte.p2m.write = 1; > > > + write_pte(vlp2m_pte, pte); > > > + flush_tlb_local(); > > > + > > > + /* in order to remove mappings in cleanup stage */ > > > + add_mapped_vaddr(d, addr); > > > > No locks or atomic operations here? How are races with the tools reading > > the dirty bitmap addressed? If it is by clever ordering of the checks and > > pte writes then I think a comment would be in order. > > Actually, the lock is inside the add_mapped_vaddr function. But that only covers the bitmap update, not the pte frobbing. > > > > > > + } > > > +} > > > + > > > /* > > > * Local variables: > > > * mode: C > > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index > > > 2d09fef..3b2b11a 100644 > > > --- a/xen/arch/arm/p2m.c > > > +++ b/xen/arch/arm/p2m.c > > > @@ -6,6 +6,8 @@ > > > #include <xen/bitops.h> > > > #include <asm/flushtlb.h> > > > #include <asm/gic.h> > > > +#include <xen/guest_access.h> > > > +#include <xen/pfn.h> > > > > > > void dump_p2m_lookup(struct domain *d, paddr_t addr) { @@ -408,6 > > > +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long > > gpfn) > > > return p >> PAGE_SHIFT; > > > } > > > > > > +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - sizeof(int))\ > > > + / sizeof(unsigned long) > > > + > > > +/* An array-based linked list for storing virtual addresses (i.e., > > > +the 3rd > > > + * level table mapping) that should be destroyed after live migration > > > +*/ struct mapped_va_node { > > > + struct page_info *next; > > > + struct mapped_va_node *mvn_next; > > > + int items; > > > + unsigned long vaddrs[MAX_VA_PER_NODE]; }; > > > + > > > +int add_mapped_vaddr(struct domain *d, unsigned long va) { > > > > This function seems awefuly expensive (contains allocations etc) for a > > function called on every page fault during a migration. > > > > Why are you remembering the full va of every fault when a single bit per > > page would do? > > > > I think you can allocate a bitmap when logdirty is enabled. At a bit per > > page it shouldn't be too huge. > > > > You might be able to encode this in the p2m which you walk when the tools > > ask for the bit map (i.e. p2m.type==writeable p2m.write==0 => dirty), but > > I think a bitmap would do the trick and be easier to implement. > > There are three candidates: > > 1) bitmap > 2) encoding into p2m > 3) linked list of VAs. > > And, two functions for dirty-page tracing: > > A) dirty-page handling (At trap) > B) getting dirty bitmap (for toolstack) > > 1) and 2) have advantage for A) but have to search the full range of pages at B) > to find out which page is dirtied and set the page as read-only for next dirty > detection. On the otherhand, 3) does not need to search the full range at B). > I prefer 3) due to the 'non-full range search' but I understand your concern. > Maybe I can do some measurement for each method to see which one better. If you have a bitmap then you can do a scan for each set bit, and the offset of each bit corresponds to the guest pfn, which allows you to directly calculate the address of the pte in the vlpt. I don't think doing a scan for each set bit is going to compare unfavourably to walking a linked list. Certainly not once the memory allocator gets involved. > > > +/* get the dirty bitmap from the linked list stored by > > > +add_mapped_vaddr > > > + * and also clear the mapped vaddrs linked list */ static void > > > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) { > > > + struct page_info *page_head; > > > + struct mapped_va_node *mvn_head; > > > + struct page_info *tmp_page; > > > + struct mapped_va_node *tmp_mvn; > > > + vaddr_t gma_start = 0; > > > + vaddr_t gma_end = 0; > > > + > > > + /* this hypercall is called from domain 0, and we don't know which > > guest's > > > + * vlpt is mapped in xen_second, so, to be sure, we restore vlpt > > here */ > > > + restore_vlpt(d); > > > > AFAICT you don't actually use the vlpt here, just the domains list of > > dirty addresses (which as above could become a bitmap) > > I think vlpt is still needed because at this stage, we have to reset the > write permission of dirtied pages. Maybe some tricks for efficiently doing this? Ah, of course. Do you need to be careful about the context switch of a dom0 vcpu which has a foreign vlpt loaded? I suppose you unload it at the end of this function so it is safe because Xen doesn't preempt vcpus in hypervisor mode. > > > > > + struct p2m_domain *p2m = &d->arch.p2m; > > > + vaddr_t ram_base; > > > + int i1, i2, i3; > > > + int first_index, second_index, third_index; > > > + lpae_t *first = __map_domain_page(p2m->first_level); > > > + lpae_t pte, *second = NULL, *third = NULL; > > > + > > > + get_gma_start_end(d, &ram_base, NULL); > > > + > > > + first_index = first_table_offset((uint64_t)ram_base); > > > + second_index = second_table_offset((uint64_t)ram_base); > > > + third_index = third_table_offset((uint64_t)ram_base); > > > + > > > + BUG_ON( !first && "Can't map first level p2m." ); > > > + > > > + spin_lock(&p2m->lock); > > > + > > > + for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 ) > > > + { > > > + lpae_walk_t first_pte = first[i1].walk; > > > + if ( !first_pte.valid || !first_pte.table ) > > > + goto out; > > > + > > > + second = map_domain_page(first_pte.base); > > > + BUG_ON( !second && "Can't map second level p2m."); > > > + for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 ) > > > + { > > > + lpae_walk_t second_pte = second[i2].walk; > > > + > > > + /* a nasty hack for vlpt due to the difference > > > + * of page table entry layout between p2m and pt */ > > > + second[i2].pt.ro = 0; > > > > What is the hack here? > > > > The issue is that the p2m second level, which is a table entry in the p2m > > is installed into the Xen page tables where it ends up the third level, > > treated as a block entry. > > > > That's OK, because for both PT and P2M bits 2..10 are ignored for table > > entries. So I think rather than this hack here we should simply ensure > > that our non-leaf p2m's have the correct bits in them to be used as p2m > > table entries. > > Oh, I see. So, should I put something like this in create_p2m_entries function? Probably p2m_create_table or maybe mfn_to_p2m entry would be the right place. Ian. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-17 10:30 ` Ian Campbell @ 2013-10-17 11:05 ` Jaeyong Yoo 2013-10-17 11:47 ` Ian Campbell 0 siblings, 1 reply; 18+ messages in thread From: Jaeyong Yoo @ 2013-10-17 11:05 UTC (permalink / raw) To: 'Ian Campbell'; +Cc: 'Stefano Stabellini', xen-devel > -----Original Message----- > From: xen-devel-bounces@lists.xen.org [mailto:xen-devel- > bounces@lists.xen.org] On Behalf Of Ian Campbell > Sent: Thursday, October 17, 2013 7:30 PM > To: Jaeyong Yoo > Cc: 'Stefano Stabellini'; xen-devel@lists.xen.org > Subject: Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for > dirty page tracing > > On Thu, 2013-10-17 at 19:12 +0900, Jaeyong Yoo wrote: > > > > > @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d) > > > > unmap_domain_page_global(d->arch.dirty.second_lvl); > > > > } > > > > > > > > +/* routine for dirty-page tracing > > > > + * > > > > + * On first write, it page faults, its entry is changed to > > > > +read-write, > > > > + * and on retry the write succeeds. > > > > + * > > > > + * for locating p2m of the faulting entry, we use virtual-linear > > > > +page > > > table. > > > > + */ > > > > +void handle_page_fault(struct domain *d, paddr_t addr) { > > > > + lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr); > > > > + if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 ) > > > > > > I think you need to distinguish p2m entries which are r/o due to log > > > dirty from those which are r/o due to wanting them to be r/o. > > > > > > Maybe we don't have the latter right now? We will eventually need > > > p2m types I think. > > > > Yes, that should be distinguished. Actually, that was in my mind and > > apparently I failed to remember. For doing this, I'm thinking to use > > un-used field in pte as an indicator of 'wanting to be r/o' and check > this indicator in permission fault. > > Yes, I think we can use the 4 avail bits to store a p2m type. ISTR > discussing this before, but perhaps it wasn't with you! > > > > > > > > > > + { > > > > + lpae_t pte = *vlp2m_pte; > > > > + pte.p2m.write = 1; > > > > + write_pte(vlp2m_pte, pte); > > > > + flush_tlb_local(); > > > > + > > > > + /* in order to remove mappings in cleanup stage */ > > > > + add_mapped_vaddr(d, addr); > > > > > > No locks or atomic operations here? How are races with the tools > > > reading the dirty bitmap addressed? If it is by clever ordering of > > > the checks and pte writes then I think a comment would be in order. > > > > Actually, the lock is inside the add_mapped_vaddr function. > > But that only covers the bitmap update, not the pte frobbing. I also locked with the same lock at the get_dirty_bitmap which is reading the bitmap. > > > > > > > > > > + } > > > > +} > > > > + > > > > /* > > > > * Local variables: > > > > * mode: C > > > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index > > > > 2d09fef..3b2b11a 100644 > > > > --- a/xen/arch/arm/p2m.c > > > > +++ b/xen/arch/arm/p2m.c > > > > @@ -6,6 +6,8 @@ > > > > #include <xen/bitops.h> > > > > #include <asm/flushtlb.h> > > > > #include <asm/gic.h> > > > > +#include <xen/guest_access.h> > > > > +#include <xen/pfn.h> > > > > > > > > void dump_p2m_lookup(struct domain *d, paddr_t addr) { @@ -408,6 > > > > +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned > > > > +long > > > gpfn) > > > > return p >> PAGE_SHIFT; > > > > } > > > > > > > > +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - > sizeof(int))\ > > > > + / sizeof(unsigned long) > > > > + > > > > +/* An array-based linked list for storing virtual addresses > > > > +(i.e., the 3rd > > > > + * level table mapping) that should be destroyed after live > > > > +migration */ struct mapped_va_node { > > > > + struct page_info *next; > > > > + struct mapped_va_node *mvn_next; > > > > + int items; > > > > + unsigned long vaddrs[MAX_VA_PER_NODE]; }; > > > > + > > > > +int add_mapped_vaddr(struct domain *d, unsigned long va) { > > > > > > This function seems awefuly expensive (contains allocations etc) for > > > a function called on every page fault during a migration. > > > > > > Why are you remembering the full va of every fault when a single bit > > > per page would do? > > > > > > I think you can allocate a bitmap when logdirty is enabled. At a bit > > > per page it shouldn't be too huge. > > > > > > You might be able to encode this in the p2m which you walk when the > > > tools ask for the bit map (i.e. p2m.type==writeable p2m.write==0 => > > > dirty), but I think a bitmap would do the trick and be easier to > implement. > > > > There are three candidates: > > > > 1) bitmap > > 2) encoding into p2m > > 3) linked list of VAs. > > > > And, two functions for dirty-page tracing: > > > > A) dirty-page handling (At trap) > > B) getting dirty bitmap (for toolstack) > > > > 1) and 2) have advantage for A) but have to search the full range of > > pages at B) to find out which page is dirtied and set the page as > > read-only for next dirty detection. On the otherhand, 3) does not need > to search the full range at B). > > I prefer 3) due to the 'non-full range search' but I understand your > concern. > > Maybe I can do some measurement for each method to see which one better. > > If you have a bitmap then you can do a scan for each set bit, and the > offset of each bit corresponds to the guest pfn, which allows you to > directly calculate the address of the pte in the vlpt. > > I don't think doing a scan for each set bit is going to compare > unfavourably to walking a linked list. Certainly not once the memory > allocator gets involved. Oh, do you mean using something like "find_next_zero_bit"? At the first time, I thought searching every bit-by-bit but, I just see findbit.S and it looks quite optimized. > > > > > +/* get the dirty bitmap from the linked list stored by > > > > +add_mapped_vaddr > > > > + * and also clear the mapped vaddrs linked list */ static void > > > > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) { > > > > + struct page_info *page_head; > > > > + struct mapped_va_node *mvn_head; > > > > + struct page_info *tmp_page; > > > > + struct mapped_va_node *tmp_mvn; > > > > + vaddr_t gma_start = 0; > > > > + vaddr_t gma_end = 0; > > > > + > > > > + /* this hypercall is called from domain 0, and we don't know > > > > + which > > > guest's > > > > + * vlpt is mapped in xen_second, so, to be sure, we restore > > > > + vlpt > > > here */ > > > > + restore_vlpt(d); > > > > > > AFAICT you don't actually use the vlpt here, just the domains list > > > of dirty addresses (which as above could become a bitmap) > > > > I think vlpt is still needed because at this stage, we have to reset > > the write permission of dirtied pages. Maybe some tricks for efficiently > doing this? > > Ah, of course. > > Do you need to be careful about the context switch of a dom0 vcpu which > has a foreign vlpt loaded? I suppose you unload it at the end of this > function so it is safe because Xen doesn't preempt vcpus in hypervisor > mode. That is my concern now. For safety, unloading is necessary but it requires flushing vlpt which is not a cheap operation. Actually, there is one more concern: Since we only context switch the vlpt area with 'migrating domains' when two domains (one migrating and the other not) are context switched, the non-migrating domain have access to the foreign vlpt. In order to prevent the access of foreign vlpt area, do you think unloading of vlpt when switching into non-migrating domains also necessary? > > > > > > > > + struct p2m_domain *p2m = &d->arch.p2m; > > > > + vaddr_t ram_base; > > > > + int i1, i2, i3; > > > > + int first_index, second_index, third_index; > > > > + lpae_t *first = __map_domain_page(p2m->first_level); > > > > + lpae_t pte, *second = NULL, *third = NULL; > > > > + > > > > + get_gma_start_end(d, &ram_base, NULL); > > > > + > > > > + first_index = first_table_offset((uint64_t)ram_base); > > > > + second_index = second_table_offset((uint64_t)ram_base); > > > > + third_index = third_table_offset((uint64_t)ram_base); > > > > + > > > > + BUG_ON( !first && "Can't map first level p2m." ); > > > > + > > > > + spin_lock(&p2m->lock); > > > > + > > > > + for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 ) > > > > + { > > > > + lpae_walk_t first_pte = first[i1].walk; > > > > + if ( !first_pte.valid || !first_pte.table ) > > > > + goto out; > > > > + > > > > + second = map_domain_page(first_pte.base); > > > > + BUG_ON( !second && "Can't map second level p2m."); > > > > + for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 ) > > > > + { > > > > + lpae_walk_t second_pte = second[i2].walk; > > > > + > > > > + /* a nasty hack for vlpt due to the difference > > > > + * of page table entry layout between p2m and pt */ > > > > + second[i2].pt.ro = 0; > > > > > > What is the hack here? > > > > > > The issue is that the p2m second level, which is a table entry in > > > the p2m is installed into the Xen page tables where it ends up the > > > third level, treated as a block entry. > > > > > > That's OK, because for both PT and P2M bits 2..10 are ignored for > > > table entries. So I think rather than this hack here we should > > > simply ensure that our non-leaf p2m's have the correct bits in them > > > to be used as p2m table entries. > > > > Oh, I see. So, should I put something like this in create_p2m_entries > function? > > Probably p2m_create_table or maybe mfn_to_p2m entry would be the right > place. OK. > > Ian. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-17 11:05 ` Jaeyong Yoo @ 2013-10-17 11:47 ` Ian Campbell 2013-10-18 4:17 ` Jaeyong Yoo 0 siblings, 1 reply; 18+ messages in thread From: Ian Campbell @ 2013-10-17 11:47 UTC (permalink / raw) To: Jaeyong Yoo; +Cc: 'Stefano Stabellini', xen-devel On Thu, 2013-10-17 at 20:05 +0900, Jaeyong Yoo wrote: > > > > > > > > > > > > + { > > > > > + lpae_t pte = *vlp2m_pte; > > > > > + pte.p2m.write = 1; > > > > > + write_pte(vlp2m_pte, pte); > > > > > + flush_tlb_local(); > > > > > + > > > > > + /* in order to remove mappings in cleanup stage */ > > > > > + add_mapped_vaddr(d, addr); > > > > > > > > No locks or atomic operations here? How are races with the tools > > > > reading the dirty bitmap addressed? If it is by clever ordering of > > > > the checks and pte writes then I think a comment would be in order. > > > > > > Actually, the lock is inside the add_mapped_vaddr function. > > > > But that only covers the bitmap update, not the pte frobbing. > > I also locked with the same lock at the get_dirty_bitmap which is reading > the bitmap. I meant the pte frobbing immediately before the call the add_mapped_vaddr quoted above. > > > There are three candidates: > > > > > > 1) bitmap > > > 2) encoding into p2m > > > 3) linked list of VAs. > > > > > > And, two functions for dirty-page tracing: > > > > > > A) dirty-page handling (At trap) > > > B) getting dirty bitmap (for toolstack) > > > > > > 1) and 2) have advantage for A) but have to search the full range of > > > pages at B) to find out which page is dirtied and set the page as > > > read-only for next dirty detection. On the otherhand, 3) does not need > > to search the full range at B). > > > I prefer 3) due to the 'non-full range search' but I understand your > > concern. > > > Maybe I can do some measurement for each method to see which one better. > > > > If you have a bitmap then you can do a scan for each set bit, and the > > offset of each bit corresponds to the guest pfn, which allows you to > > directly calculate the address of the pte in the vlpt. > > > > I don't think doing a scan for each set bit is going to compare > > unfavourably to walking a linked list. Certainly not once the memory > > allocator gets involved. > > Oh, do you mean using something like "find_next_zero_bit"? find_next_bit probably since you want to find the 1s. I suppose you could also initialise to all 1 and clear bits instead > At the first > time, > I thought searching every bit-by-bit but, I just see findbit.S and it looks > quite optimized. Yes, I think it should be fast enough. > > > > > +/* get the dirty bitmap from the linked list stored by > > > > > +add_mapped_vaddr > > > > > + * and also clear the mapped vaddrs linked list */ static void > > > > > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) { > > > > > + struct page_info *page_head; > > > > > + struct mapped_va_node *mvn_head; > > > > > + struct page_info *tmp_page; > > > > > + struct mapped_va_node *tmp_mvn; > > > > > + vaddr_t gma_start = 0; > > > > > + vaddr_t gma_end = 0; > > > > > + > > > > > + /* this hypercall is called from domain 0, and we don't know > > > > > + which > > > > guest's > > > > > + * vlpt is mapped in xen_second, so, to be sure, we restore > > > > > + vlpt > > > > here */ > > > > > + restore_vlpt(d); > > > > > > > > AFAICT you don't actually use the vlpt here, just the domains list > > > > of dirty addresses (which as above could become a bitmap) > > > > > > I think vlpt is still needed because at this stage, we have to reset > > > the write permission of dirtied pages. Maybe some tricks for efficiently > > doing this? > > > > Ah, of course. > > > > Do you need to be careful about the context switch of a dom0 vcpu which > > has a foreign vlpt loaded? I suppose you unload it at the end of this > > function so it is safe because Xen doesn't preempt vcpus in hypervisor > > mode. > > > That is my concern now. For safety, unloading is necessary but it requires > flushing vlpt which is not a cheap operation. I suppose it could be deferred either to the next load of a vplt, if it is a different one, or the next context switch away from the vcpu. You'd just need to track which foreign vlpt was currently loaded on the vcpu in vcpu->arch.something? > Actually, there is one more concern: > Since we only context switch the vlpt area with 'migrating domains' when two > domains (one migrating and the other not) are context switched, the > non-migrating > domain have access to the foreign vlpt. In order to prevent the access of > foreign vlpt area, do you think unloading of vlpt when switching into > non-migrating > domains also necessary? The domain itself doesn't have access to the vlpt since it is mapped in hypervisor context so the only danger is that Xen accidentally uses the vlpt for something while the non-migrating domain is scheduled. I think we should just not write that bug ;-) IOW I think flushing the vlpt lazily is OK. Ian. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-17 11:47 ` Ian Campbell @ 2013-10-18 4:17 ` Jaeyong Yoo 2013-10-18 9:11 ` Ian Campbell 0 siblings, 1 reply; 18+ messages in thread From: Jaeyong Yoo @ 2013-10-18 4:17 UTC (permalink / raw) To: 'Ian Campbell'; +Cc: 'Stefano Stabellini', xen-devel > -----Original Message----- > From: xen-devel-bounces@lists.xen.org [mailto:xen-devel- > bounces@lists.xen.org] On Behalf Of Ian Campbell > Sent: Thursday, October 17, 2013 8:47 PM > To: Jaeyong Yoo > Cc: 'Stefano Stabellini'; xen-devel@lists.xen.org > Subject: Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for > dirty page tracing > > On Thu, 2013-10-17 at 20:05 +0900, Jaeyong Yoo wrote: > > > > > > > > > > > > > > > + { > > > > > > + lpae_t pte = *vlp2m_pte; > > > > > > + pte.p2m.write = 1; > > > > > > + write_pte(vlp2m_pte, pte); > > > > > > + flush_tlb_local(); > > > > > > + > > > > > > + /* in order to remove mappings in cleanup stage */ > > > > > > + add_mapped_vaddr(d, addr); > > > > > > > > > > No locks or atomic operations here? How are races with the tools > > > > > reading the dirty bitmap addressed? If it is by clever ordering > > > > > of the checks and pte writes then I think a comment would be in > order. > > > > > > > > Actually, the lock is inside the add_mapped_vaddr function. > > > > > > But that only covers the bitmap update, not the pte frobbing. > > > > I also locked with the same lock at the get_dirty_bitmap which is > > reading the bitmap. > > I meant the pte frobbing immediately before the call the add_mapped_vaddr > quoted above. > I don't think it is necessary because at the moment the dirty bitmap is constructed based on the VAs in the linked list. If the PTE frobbing happens immediately before add_mapped_vaddr, the corresponding dirty-page would be checked at the next round get-dirty-gitmap. Now, I start thinking of using bitmap rather than linked list, and I think it is similar there. The critical section would be the one that writes into the bitmap and the one that reads the bitmap (copy to the bitmap from toolstack). Jaeyong ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing 2013-10-18 4:17 ` Jaeyong Yoo @ 2013-10-18 9:11 ` Ian Campbell 0 siblings, 0 replies; 18+ messages in thread From: Ian Campbell @ 2013-10-18 9:11 UTC (permalink / raw) To: Jaeyong Yoo; +Cc: 'Stefano Stabellini', xen-devel On Fri, 2013-10-18 at 13:17 +0900, Jaeyong Yoo wrote: > > I meant the pte frobbing immediately before the call the add_mapped_vaddr > > quoted above. > > > > I don't think it is necessary because at the moment the dirty bitmap > is constructed based on the VAs in the linked list. If the PTE frobbing > happens > immediately before add_mapped_vaddr, the corresponding dirty-page would be > checked at the next round get-dirty-gitmap. OK, thanks. Can we get a code comment on the ordering constraints here please. > > > Now, I start thinking of using bitmap rather than linked list, and I think > it > is similar there. The critical section would be the one that writes into the > bitmap and the one that reads the bitmap (copy to the bitmap from > toolstack). > > Jaeyong > ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2013-10-20 4:22 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-10-20 4:22 [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing Jaeyong Yoo -- strict thread matches above, loose matches on Subject: below -- 2013-10-08 6:29 Jaeyong Yoo 2013-10-08 8:46 ` Ian Campbell 2013-10-08 9:46 ` Jaeyong Yoo 2013-10-08 15:36 ` Eugene Fedotov 2013-10-10 14:13 ` Julien Grall 2013-10-16 13:44 ` Ian Campbell 2013-10-04 4:43 [PATCH v4 0/9] xen/arm: live migration support in arndale board Jaeyong Yoo 2013-10-04 4:44 ` [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing Jaeyong Yoo 2013-10-07 13:02 ` Julien Grall 2013-10-07 15:51 ` Eugene Fedotov 2013-10-10 14:10 ` Julien Grall 2013-10-16 13:44 ` Ian Campbell 2013-10-17 10:12 ` Jaeyong Yoo 2013-10-17 10:30 ` Ian Campbell 2013-10-17 11:05 ` Jaeyong Yoo 2013-10-17 11:47 ` Ian Campbell 2013-10-18 4:17 ` Jaeyong Yoo 2013-10-18 9:11 ` Ian Campbell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).