xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-04  4:43 [PATCH v4 0/9] xen/arm: live migration support in arndale board Jaeyong Yoo
@ 2013-10-04  4:44 ` Jaeyong Yoo
  2013-10-07 13:02   ` Julien Grall
  2013-10-16 13:44   ` Ian Campbell
  0 siblings, 2 replies; 18+ messages in thread
From: Jaeyong Yoo @ 2013-10-04  4:44 UTC (permalink / raw)
  To: xen-devel; +Cc: Jaeyong Yoo

Add hypercall (shadow op: enable/disable and clean/peek dirted page bitmap).
It consists of two parts: dirty page detecting and saving.
For detecting, we setup the guest p2m's leaf PTE read-only and whenever
the guest tries to write something, permission fault happens and traps into xen.
The permission-faulted GPA should be saved for the toolstack (when it wants to see
which pages are dirted). For this purpose, we temporarily save the GPAs into linked
list by using 'add_mapped_vaddr' function and when toolstack wants
(log_dirty_op function) the GPAs are copied into bitmap and the linnked list is flushed.

Additionally, for supporting parallel migration of domUs, vlpt area should be context
switched.

Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com>
---
 xen/arch/arm/domain.c           |  15 ++
 xen/arch/arm/domctl.c           |  13 ++
 xen/arch/arm/mm.c               |  23 ++-
 xen/arch/arm/p2m.c              | 300 ++++++++++++++++++++++++++++++++++++++++
 xen/arch/arm/traps.c            |  21 ++-
 xen/include/asm-arm/domain.h    |   5 +
 xen/include/asm-arm/mm.h        |   4 +
 xen/include/asm-arm/p2m.h       |   4 +
 xen/include/asm-arm/processor.h |   2 +
 9 files changed, 382 insertions(+), 5 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 57be341..40ebeed 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -215,6 +215,10 @@ static void ctxt_switch_to(struct vcpu *n)
     WRITE_SYSREG(hcr, HCR_EL2);
     isb();
 
+    /* vlpt area context switching for dirty-page tracing */
+    if ( n->domain->arch.dirty.mode )
+        restore_vlpt(n->domain);
+
     /* This is could trigger an hardware interrupt from the virtual
      * timer. The interrupt needs to be injected into the guest. */
     virt_timer_restore(n);
@@ -512,6 +516,17 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags)
     spin_lock_init(&d->arch.map_lock);
     d->arch.map_domain.nr_banks = 0;
 
+    /* init for dirty-page tracing */
+    d->arch.dirty.count = 0;
+    d->arch.dirty.mode = 0;
+    d->arch.dirty.head = NULL;
+    d->arch.dirty.mvn_head = NULL;
+    spin_lock_init(&d->arch.dirty.lock);
+
+    d->arch.dirty.second_lvl_start = 0;
+    d->arch.dirty.second_lvl_end = 0;
+    d->arch.dirty.second_lvl = NULL;
+
     clear_page(d->shared_info);
     share_xen_page_with_guest(
         virt_to_page(d->shared_info), d, XENSHARE_writable);
diff --git a/xen/arch/arm/domctl.c b/xen/arch/arm/domctl.c
index 78fb322..7edce12 100644
--- a/xen/arch/arm/domctl.c
+++ b/xen/arch/arm/domctl.c
@@ -92,6 +92,19 @@ long arch_do_domctl(struct xen_domctl *domctl, struct domain *d,
             xfree(c.data);
     }
     break;
+    case XEN_DOMCTL_shadow_op:
+    {
+        domain_pause(d);
+        ret = dirty_mode_op(d, &domctl->u.shadow_op);
+        domain_unpause(d);
+
+        if ( (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_CLEAN ||
+             (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_PEEK )
+        {
+            copyback = 1;
+        }
+    }
+    break;
 
     default:
         return -EINVAL;
diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index 98edbc9..5b9d26d 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -838,7 +838,6 @@ void destroy_xen_mappings(unsigned long v, unsigned long e)
     create_xen_entries(REMOVE, v, 0, (e - v) >> PAGE_SHIFT, 0);
 }
 
-enum mg { mg_clear, mg_ro, mg_rw, mg_rx };
 static void set_pte_flags_on_range(const char *p, unsigned long l, enum mg mg)
 {
     lpae_t pte;
@@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d)
     unmap_domain_page_global(d->arch.dirty.second_lvl);
 }
 
+/* routine for dirty-page tracing
+ *
+ * On first write, it page faults, its entry is changed to read-write,
+ * and on retry the write succeeds.
+ *
+ * for locating p2m of the faulting entry, we use virtual-linear page table.
+ */
+void handle_page_fault(struct domain *d, paddr_t addr)
+{
+    lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr);
+    if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 )
+    {
+        lpae_t pte = *vlp2m_pte;
+        pte.p2m.write = 1;
+        write_pte(vlp2m_pte, pte);
+        flush_tlb_local();
+
+        /* in order to remove mappings in cleanup stage */
+        add_mapped_vaddr(d, addr);
+    }
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index 2d09fef..3b2b11a 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -6,6 +6,8 @@
 #include <xen/bitops.h>
 #include <asm/flushtlb.h>
 #include <asm/gic.h>
+#include <xen/guest_access.h>
+#include <xen/pfn.h>
 
 void dump_p2m_lookup(struct domain *d, paddr_t addr)
 {
@@ -408,6 +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long gpfn)
     return p >> PAGE_SHIFT;
 }
 
+#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - sizeof(int))\
+                        / sizeof(unsigned long)
+
+/* An array-based linked list for storing virtual addresses (i.e., the 3rd
+ * level table mapping) that should be destroyed after live migration */
+struct mapped_va_node
+{
+    struct page_info *next;
+    struct mapped_va_node *mvn_next;
+    int items;
+    unsigned long vaddrs[MAX_VA_PER_NODE];
+};
+
+int add_mapped_vaddr(struct domain *d, unsigned long va)
+{
+    struct page_info *page_head;
+    struct mapped_va_node *mvn_head;
+
+    spin_lock(&d->arch.dirty.lock);
+
+    page_head = d->arch.dirty.head;
+    mvn_head = d->arch.dirty.mvn_head;
+    if ( !mvn_head )
+    {
+        page_head = alloc_domheap_page(NULL, 0);
+        if ( !page_head )
+        {
+            spin_unlock(&d->arch.dirty.lock);
+            return -ENOMEM;
+        }
+
+        mvn_head = map_domain_page_global(__page_to_mfn(page_head));
+        mvn_head->items = 0;
+        mvn_head->next = NULL;
+        mvn_head->mvn_next = NULL;
+        d->arch.dirty.head = page_head;
+        d->arch.dirty.mvn_head = mvn_head;
+    }
+
+    if ( mvn_head->items == MAX_VA_PER_NODE )
+    {
+        struct page_info *page;
+
+        page = alloc_domheap_page(NULL, 0);
+        if ( !page )
+        {
+            spin_unlock(&d->arch.dirty.lock);
+            return -ENOMEM;
+        }
+
+        mvn_head = map_domain_page_global(__page_to_mfn(page));
+        mvn_head->items = 0;
+        mvn_head->next = d->arch.dirty.head;
+        mvn_head->mvn_next = d->arch.dirty.mvn_head;
+
+        d->arch.dirty.mvn_head = mvn_head;
+        d->arch.dirty.head = page;
+    }
+
+    mvn_head->vaddrs[mvn_head->items] = va;
+    mvn_head->items ++;
+
+    spin_unlock(&d->arch.dirty.lock);
+    return 0;
+}
+
+/* get the dirty bitmap from the linked list stored by add_mapped_vaddr
+ * and also clear the mapped vaddrs linked list */
+static void get_dirty_bitmap(struct domain *d, uint8_t *bitmap[])
+{
+    struct page_info *page_head;
+    struct mapped_va_node *mvn_head;
+    struct page_info *tmp_page;
+    struct mapped_va_node *tmp_mvn;
+    vaddr_t gma_start = 0;
+    vaddr_t gma_end = 0;
+
+    /* this hypercall is called from domain 0, and we don't know which guest's
+     * vlpt is mapped in xen_second, so, to be sure, we restore vlpt here */
+    restore_vlpt(d);
+
+    spin_lock(&d->arch.dirty.lock);
+
+    page_head = d->arch.dirty.head;
+    mvn_head = d->arch.dirty.mvn_head;
+    get_gma_start_end(d, &gma_start, &gma_end);
+
+    while ( page_head )
+    {
+        int i;
+
+        for ( i = 0; i < mvn_head->items; ++i )
+        {
+            unsigned int bit_offset;
+            int bitmap_index, bitmap_offset;
+            lpae_t *vlp2m_pte;
+
+            bit_offset = third_linear_offset(mvn_head->vaddrs[i] - gma_start);
+            bitmap_index = bit_offset >> (PAGE_SHIFT + 3);
+            bitmap_offset = bit_offset & ((1ul << (PAGE_SHIFT + 3)) - 1);
+
+            __set_bit(bitmap_offset, bitmap[bitmap_index]);
+
+            vlp2m_pte = get_vlpt_3lvl_pte(mvn_head->vaddrs[i]);
+            vlp2m_pte->p2m.write = 0;
+        }
+
+        tmp_page = page_head;
+        page_head = mvn_head->next;
+
+        tmp_mvn = mvn_head;
+        mvn_head = mvn_head->mvn_next;
+
+        unmap_domain_page_global(tmp_mvn);
+        free_domheap_page(tmp_page);
+    }
+
+    d->arch.dirty.head = NULL;
+    d->arch.dirty.mvn_head = NULL;
+
+    spin_unlock(&d->arch.dirty.lock);
+}
+
+
+/* Change types across all p2m entries in a domain */
+static void p2m_change_entry_type_global(struct domain *d, enum mg nt)
+{
+    struct p2m_domain *p2m = &d->arch.p2m;
+    vaddr_t ram_base;
+    int i1, i2, i3;
+    int first_index, second_index, third_index;
+    lpae_t *first = __map_domain_page(p2m->first_level);
+    lpae_t pte, *second = NULL, *third = NULL;
+
+    get_gma_start_end(d, &ram_base, NULL);
+
+    first_index = first_table_offset((uint64_t)ram_base);
+    second_index = second_table_offset((uint64_t)ram_base);
+    third_index = third_table_offset((uint64_t)ram_base);
+
+    BUG_ON( !first && "Can't map first level p2m." );
+
+    spin_lock(&p2m->lock);
+
+    for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 )
+    {
+        lpae_walk_t first_pte = first[i1].walk;
+        if ( !first_pte.valid || !first_pte.table )
+            goto out;
+
+        second = map_domain_page(first_pte.base);
+        BUG_ON( !second && "Can't map second level p2m.");
+        for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 )
+        {
+            lpae_walk_t second_pte = second[i2].walk;
+
+            /* a nasty hack for vlpt due to the difference
+             * of page table entry layout between p2m and pt */
+            second[i2].pt.ro = 0;
+
+            if ( !second_pte.valid || !second_pte.table )
+                goto out;
+
+            third = map_domain_page(second_pte.base);
+            BUG_ON( !third && "Can't map third level p2m.");
+
+            for ( i3 = third_index; i3 < LPAE_ENTRIES; ++i3 )
+            {
+                lpae_walk_t third_pte = third[i3].walk;
+                int write = 0;
+                if ( !third_pte.valid )
+                    goto out;
+
+                pte = third[i3];
+                if ( pte.p2m.write == 1 && nt == mg_ro )
+                {
+                    pte.p2m.write = 0;
+                    write = 1;
+                }
+                else if ( pte.p2m.write == 0 && nt == mg_rw )
+                {
+                    pte.p2m.write = 1;
+                    write = 1;
+                }
+                if ( write )
+                    write_pte(&third[i3], pte);
+            }
+            unmap_domain_page(third);
+
+            third = NULL;
+            third_index = 0;
+        }
+        unmap_domain_page(second);
+
+        second = NULL;
+        second_index = 0;
+        third_index = 0;
+    }
+
+out:
+    flush_tlb_all_local();
+    if ( third ) unmap_domain_page(third);
+    if ( second ) unmap_domain_page(second);
+    if ( first ) unmap_domain_page(first);
+
+    spin_unlock(&p2m->lock);
+}
+
+/* Read a domain's log-dirty bitmap and stats.
+ * If the operation is a CLEAN, clear the bitmap and stats. */
+int log_dirty_op(struct domain *d, xen_domctl_shadow_op_t *sc)
+{
+    unsigned long gmfn_start;
+    unsigned long gmfn_end;
+    unsigned long gmfns;
+    unsigned int bitmap_pages;
+    int rc = 0, clean = 0, peek = 1;
+    uint8_t *bitmap[256]; /* bitmap[256] covers 32GB ram */
+    int i;
+
+    BUG_ON( !d->arch.map_domain.nr_banks );
+
+    gmfn_start = d->arch.map_domain.bank[0].start >> PAGE_SHIFT;
+    gmfn_end = domain_get_maximum_gpfn(d);
+    gmfns = gmfn_end - gmfn_start;
+    bitmap_pages = PFN_UP((gmfns + 7) / 8);
+
+    if ( guest_handle_is_null(sc->dirty_bitmap) )
+    {
+        peek = 0;
+    }
+    else
+    {
+        /* mapping to the bitmap from guest param */
+        vaddr_t to = (vaddr_t)sc->dirty_bitmap.p;
+
+        BUG_ON( to & ~PAGE_MASK && "offset not aligned to PAGE SIZE");
+
+        for ( i = 0; i < bitmap_pages; ++i )
+        {
+            paddr_t g;
+            rc = gvirt_to_maddr(to, &g);
+            if ( rc )
+                return rc;
+            bitmap[i] = map_domain_page(g>>PAGE_SHIFT);
+            memset(bitmap[i], 0x00, PAGE_SIZE);
+            to += PAGE_SIZE;
+        }
+    }
+
+    clean = (sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN);
+    sc->stats.dirty_count = d->arch.dirty.count;
+
+    get_dirty_bitmap(d, bitmap);
+    if ( peek )
+    {
+        for ( i = 0; i < bitmap_pages; ++i )
+        {
+            unmap_domain_page(bitmap[i]);
+        }
+    }
+
+    return 0;
+}
+
+long dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc)
+{
+    long ret = 0;
+    switch (sc->op)
+    {
+        case XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY:
+        case XEN_DOMCTL_SHADOW_OP_OFF:
+        {
+            enum mg nt = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? mg_rw : mg_ro;
+
+            d->arch.dirty.mode = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? 0 : 1;
+            p2m_change_entry_type_global(d, nt);
+
+            if ( sc->op == XEN_DOMCTL_SHADOW_OP_OFF )
+                cleanup_vlpt(d);
+            else
+                prepare_vlpt(d);
+        }
+        break;
+
+        case XEN_DOMCTL_SHADOW_OP_CLEAN:
+        case XEN_DOMCTL_SHADOW_OP_PEEK:
+        {
+            ret = log_dirty_op(d, sc);
+        }
+        break;
+
+        default:
+            return -ENOSYS;
+    }
+    return ret;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
index 4c0fc32..3b78ed2 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
     const char *msg;
     int rc, level = -1;
     mmio_info_t info;
+    int page_fault = ((dabt.dfsc & FSC_MASK) ==
+                      (FSC_FLT_PERM + FSC_3D_LEVEL) && dabt.write);
 
     if ( !check_conditional_instr(regs, hsr) )
     {
@@ -1327,22 +1329,23 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
     info.gva = READ_SYSREG64(FAR_EL2);
 #endif
 
-    if (dabt.s1ptw)
+    if ( dabt.s1ptw && !page_fault )
         goto bad_data_abort;
 
     rc = gva_to_ipa(info.gva, &info.gpa);
-    if ( rc == -EFAULT )
+    if ( rc == -EFAULT && !page_fault )
         goto bad_data_abort;
 
     /* XXX: Decode the instruction if ISS is not valid */
-    if ( !dabt.valid )
+    if ( !dabt.valid && !page_fault )
         goto bad_data_abort;
 
     /*
      * Erratum 766422: Thumb store translation fault to Hypervisor may
      * not have correct HSR Rt value.
      */
-    if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write )
+    if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write
+            && !page_fault)
     {
         rc = decode_instruction(regs, &info.dabt);
         if ( rc )
@@ -1358,6 +1361,16 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
         return;
     }
 
+    /* handle permission fault on write */
+    if ( page_fault )
+    {
+        if ( current->domain->arch.dirty.mode == 0 )
+           goto bad_data_abort;
+
+        handle_page_fault(current->domain, info.gpa);
+        return;
+    }
+
 bad_data_abort:
 
     msg = decode_fsc( dabt.dfsc, &level);
diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h
index 7e7743a..de349e9 100644
--- a/xen/include/asm-arm/domain.h
+++ b/xen/include/asm-arm/domain.h
@@ -114,6 +114,11 @@ struct arch_domain
 
     /* dirty-page tracing */
     struct {
+        spinlock_t lock;
+        int mode;
+        unsigned int count;
+        struct page_info *head;     /* maintain the mapped vaddrs */
+        struct mapped_va_node *mvn_head;  /* va of the head */
         int second_lvl_start;            /* for context switch */
         int second_lvl_end;
         lpae_t *second_lvl;
diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
index b9cbcf2..c0eade0 100644
--- a/xen/include/asm-arm/mm.h
+++ b/xen/include/asm-arm/mm.h
@@ -366,6 +366,10 @@ static inline lpae_t * get_vlpt_3lvl_pte(vaddr_t addr)
         (ipa_third << 3) );
 }
 
+/* dirty-page tracing */
+enum mg { mg_clear, mg_ro, mg_rw, mg_rx };
+void handle_page_fault(struct domain *d, paddr_t addr);
+
 #endif /*  __ARCH_ARM_MM__ */
 /*
  * Local variables:
diff --git a/xen/include/asm-arm/p2m.h b/xen/include/asm-arm/p2m.h
index c660820..0bac332 100644
--- a/xen/include/asm-arm/p2m.h
+++ b/xen/include/asm-arm/p2m.h
@@ -2,6 +2,7 @@
 #define _XEN_P2M_H
 
 #include <xen/mm.h>
+#include <public/domctl.h>
 
 struct domain;
 
@@ -110,6 +111,9 @@ static inline int get_page_and_type(struct page_info *page,
     return rc;
 }
 
+long dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc);
+int add_mapped_vaddr(struct domain *d, unsigned long va);
+
 #endif /* _XEN_P2M_H */
 
 /*
diff --git a/xen/include/asm-arm/processor.h b/xen/include/asm-arm/processor.h
index 5294421..fced6ad 100644
--- a/xen/include/asm-arm/processor.h
+++ b/xen/include/asm-arm/processor.h
@@ -399,6 +399,8 @@ union hsr {
 #define FSC_CPR        (0x3a) /* Coprocossor Abort */
 
 #define FSC_LL_MASK    (0x03<<0)
+#define FSC_MASK       (0x3f) /* Fault status mask */
+#define FSC_3D_LEVEL   (0x03) /* Third level fault*/
 
 /* Time counter hypervisor control register */
 #define CNTHCTL_PA      (1u<<0)  /* Kernel/user access to physical counter */
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-04  4:44 ` [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing Jaeyong Yoo
@ 2013-10-07 13:02   ` Julien Grall
  2013-10-07 15:51     ` Eugene Fedotov
  2013-10-16 13:44   ` Ian Campbell
  1 sibling, 1 reply; 18+ messages in thread
From: Julien Grall @ 2013-10-07 13:02 UTC (permalink / raw)
  To: Jaeyong Yoo; +Cc: xen-devel

On 10/04/2013 05:44 AM, Jaeyong Yoo wrote:
> Add hypercall (shadow op: enable/disable and clean/peek dirted page bitmap).
> It consists of two parts: dirty page detecting and saving.
> For detecting, we setup the guest p2m's leaf PTE read-only and whenever
> the guest tries to write something, permission fault happens and traps into xen.
> The permission-faulted GPA should be saved for the toolstack (when it wants to see
> which pages are dirted). For this purpose, we temporarily save the GPAs into linked
> list by using 'add_mapped_vaddr' function and when toolstack wants
> (log_dirty_op function) the GPAs are copied into bitmap and the linnked list is flushed.
> 
> Additionally, for supporting parallel migration of domUs, vlpt area should be context
> switched.
> 
> Signed-off-by: Jaeyong Yoo <jaeyong.yoo@samsung.com>
> ---

[..]

> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> index 4c0fc32..3b78ed2 100644
> --- a/xen/arch/arm/traps.c
> +++ b/xen/arch/arm/traps.c
> @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>      const char *msg;
>      int rc, level = -1;
>      mmio_info_t info;
> +    int page_fault = ((dabt.dfsc & FSC_MASK) ==
> +                      (FSC_FLT_PERM + FSC_3D_LEVEL) && dabt.write);
>  
>      if ( !check_conditional_instr(regs, hsr) )
>      {
> @@ -1327,22 +1329,23 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>      info.gva = READ_SYSREG64(FAR_EL2);
>  #endif
>  
> -    if (dabt.s1ptw)
> +    if ( dabt.s1ptw && !page_fault )

I think checking !page_fault is nearly everywhere is error-prone when
this function will be modified.

Can you do something like this?

if ( page_fault )
  // Your code to handle page fault
else
{
  // handle_mmio
}

It will avoid && !page_fault.

>          goto bad_data_abort;
>  
>      rc = gva_to_ipa(info.gva, &info.gpa);
> -    if ( rc == -EFAULT )
> +    if ( rc == -EFAULT && !page_fault )
>          goto bad_data_abort;
>  
>      /* XXX: Decode the instruction if ISS is not valid */
> -    if ( !dabt.valid )
> +    if ( !dabt.valid && !page_fault )
>          goto bad_data_abort;
>  
>      /*
>       * Erratum 766422: Thumb store translation fault to Hypervisor may
>       * not have correct HSR Rt value.
>       */
> -    if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write )
> +    if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write
> +            && !page_fault)
>      {
>          rc = decode_instruction(regs, &info.dabt);
>          if ( rc )
> @@ -1358,6 +1361,16 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>          return;
>      }
>  
> +    /* handle permission fault on write */
> +    if ( page_fault )
> +    {
> +        if ( current->domain->arch.dirty.mode == 0 )
> +           goto bad_data_abort;
> +
> +        handle_page_fault(current->domain, info.gpa);

You must call advance_pc(regs, hsr) here.

> +        return;
> +    }
> +
>  bad_data_abort:

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-07 13:02   ` Julien Grall
@ 2013-10-07 15:51     ` Eugene Fedotov
  2013-10-10 14:10       ` Julien Grall
  0 siblings, 1 reply; 18+ messages in thread
From: Eugene Fedotov @ 2013-10-07 15:51 UTC (permalink / raw)
  To: xen-devel

07.10.2013 17:02, Julien Grall:
> I think checking !page_fault is nearly everywhere is error-prone when
> this function will be modified.
>
> Can you do something like this?
>
> if ( page_fault )
>    // Your code to handle page fault
> else
> {
>    // handle_mmio
> }
>
> It will avoid && !page_fault.
Yes, but I think at page fault condition we should check whether address 
belong MMIO regions (page fault at MMIO addr is possible, but we don't 
need that memory to transfer)
>
>> >          goto bad_data_abort;
>> >  
>> >      rc = gva_to_ipa(info.gva, &info.gpa);
>> >-    if ( rc == -EFAULT )
>> >+    if ( rc == -EFAULT && !page_fault )
>> >          goto bad_data_abort;
>> >  
>> >      /* XXX: Decode the instruction if ISS is not valid */
>> >-    if ( !dabt.valid )
>> >+    if ( !dabt.valid && !page_fault )
>> >          goto bad_data_abort;
>> >  
>> >      /*
>> >       * Erratum 766422: Thumb store translation fault to Hypervisor may
>> >       * not have correct HSR Rt value.
>> >       */
>> >-    if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write )
>> >+    if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write
>> >+            && !page_fault)
>> >      {
>> >          rc = decode_instruction(regs, &info.dabt);
>> >          if ( rc )
>> >@@ -1358,6 +1361,16 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>> >          return;
>> >      }
>> >  
>> >+    /* handle permission fault on write */
>> >+    if ( page_fault )
>> >+    {
>> >+        if ( current->domain->arch.dirty.mode == 0 )
>> >+           goto bad_data_abort;
>> >+
>> >+        handle_page_fault(current->domain, info.gpa);
> You must call advance_pc(regs, hsr) here.
>
Let me explain. I think the difference between handle_page_fault and 
handle_mmio is that the ongoing memory operation (that trapped here) 
should be repeated after handle_page_fault (the page fault handler 
clears read-only bit), but should be stepped out after handle_mmio (to 
do this we call advance_pc to increase the pc register). So, advance_pc 
is intentionally missed.

Best regards,
Evgeny.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
@ 2013-10-08  6:29 Jaeyong Yoo
  2013-10-08  8:46 ` Ian Campbell
  0 siblings, 1 reply; 18+ messages in thread
From: Jaeyong Yoo @ 2013-10-08  6:29 UTC (permalink / raw)
  To: Julien Grall; +Cc: xen-devel@lists.xen.org

>------- Original Message -------
>Sender : Julien Grall<julien.grall@citrix.com>
>Date : 2013-10-07 22:02 (GMT+09:00)
>Title : Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
>
>On 10/04/2013 05:44 AM, Jaeyong Yoo wrote:
>> Add hypercall (shadow op: enable/disable and clean/peek dirted page bitmap).
>> It consists of two parts: dirty page detecting and saving.
>> For detecting, we setup the guest p2m's leaf PTE read-only and whenever
>> the guest tries to write something, permission fault happens and traps into xen.
>> The permission-faulted GPA should be saved for the toolstack (when it wants to see
>> which pages are dirted). For this purpose, we temporarily save the GPAs into linked
>> list by using 'add_mapped_vaddr' function and when toolstack wants
>> (log_dirty_op function) the GPAs are copied into bitmap and the linnked list is flushed.
>> 
>> Additionally, for supporting parallel migration of domUs, vlpt area should be context
>> switched.
>> 
>> Signed-off-by: Jaeyong Yoo 
>> ---
>
>[..]
>
>> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
>> index 4c0fc32..3b78ed2 100644
>> --- a/xen/arch/arm/traps.c
>> +++ b/xen/arch/arm/traps.c
>> @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>>      const char *msg;
>>      int rc, level = -1;
>>      mmio_info_t info;
>> +    int page_fault = ((dabt.dfsc & FSC_MASK) ==
>> +                      (FSC_FLT_PERM + FSC_3D_LEVEL) && dabt.write);
>>  
>>      if ( !check_conditional_instr(regs, hsr) )
>>      {
>> @@ -1327,22 +1329,23 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>>      info.gva = READ_SYSREG64(FAR_EL2);
>>  #endif
>>  
>> -    if (dabt.s1ptw)
>> +    if ( dabt.s1ptw && !page_fault )
>
>I think checking !page_fault is nearly everywhere is error-prone when
>this function will be modified.
>
>Can you do something like this?
>
>if ( page_fault )
>  // Your code to handle page fault
>else
>{
>  // handle_mmio
>}
>
>It will avoid && !page_fault.

That looks better.

>>          goto bad_data_abort;
>>  
>>      rc = gva_to_ipa(info.gva, &info.gpa);
>> -    if ( rc == -EFAULT )
>> +    if ( rc == -EFAULT && !page_fault )
>>          goto bad_data_abort;
>>  
>>      /* XXX: Decode the instruction if ISS is not valid */
>> -    if ( !dabt.valid )
>> +    if ( !dabt.valid && !page_fault )
>>          goto bad_data_abort;
>>  
>>      /*
>>       * Erratum 766422: Thumb store translation fault to Hypervisor may
>>       * not have correct HSR Rt value.
>>       */
>> -    if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write )
>> +    if ( cpu_has_erratum_766422() && (regs->cpsr & PSR_THUMB) && dabt.write
>> +            && !page_fault)
>>      {
>>          rc = decode_instruction(regs, &info.dabt);
>>          if ( rc )
>> @@ -1358,6 +1361,16 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>>          return;
>>      }
>>  
>> +    /* handle permission fault on write */
>> +    if ( page_fault )
>> +    {
>> +        if ( current->domain->arch.dirty.mode == 0 )
>> +           goto bad_data_abort;
>> +
>> +        handle_page_fault(current->domain, info.gpa);
>
>You must call advance_pc(regs, hsr) here.

I got it.

>
>> +        return;
>> +    }
>> +
>>  bad_data_abort:
>
>-- 
>Julien Grall

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-08  6:29 [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing Jaeyong Yoo
@ 2013-10-08  8:46 ` Ian Campbell
  2013-10-08  9:46   ` Jaeyong Yoo
  0 siblings, 1 reply; 18+ messages in thread
From: Ian Campbell @ 2013-10-08  8:46 UTC (permalink / raw)
  To: jaeyong.yoo; +Cc: Julien Grall, xen-devel@lists.xen.org

On Tue, 2013-10-08 at 06:29 +0000, Jaeyong Yoo wrote:

> >> +    /* handle permission fault on write */
> >> +    if ( page_fault )
> >> +    {
> >> +        if ( current->domain->arch.dirty.mode == 0 )
> >> +           goto bad_data_abort;
> >> +
> >> +        handle_page_fault(current->domain, info.gpa);
> >
> >You must call advance_pc(regs, hsr) here.
> 
> I got it.

Are you sure about this? Eugene's reasoning that we need to replay the
faulting instruction seemed pretty sound.

Ian.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-08  8:46 ` Ian Campbell
@ 2013-10-08  9:46   ` Jaeyong Yoo
  2013-10-08 15:36     ` Eugene Fedotov
  0 siblings, 1 reply; 18+ messages in thread
From: Jaeyong Yoo @ 2013-10-08  9:46 UTC (permalink / raw)
  To: 'Ian Campbell'; +Cc: 'Julien Grall', xen-devel

> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> bounces@lists.xen.org] On Behalf Of Ian Campbell
> Sent: Tuesday, October 08, 2013 5:46 PM
> To: jaeyong.yoo@samsung.com
> Cc: Julien Grall; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for
> dirty page tracing
> 
> On Tue, 2013-10-08 at 06:29 +0000, Jaeyong Yoo wrote:
> 
> > >> +    /* handle permission fault on write */
> > >> +    if ( page_fault )
> > >> +    {
> > >> +        if ( current->domain->arch.dirty.mode == 0 )
> > >> +           goto bad_data_abort;
> > >> +
> > >> +        handle_page_fault(current->domain, info.gpa);
> > >
> > >You must call advance_pc(regs, hsr) here.
> >
> > I got it.
> 
> Are you sure about this? Eugene's reasoning that we need to replay the
> faulting instruction seemed pretty sound.

Oh, I've got confused. I think Eugene's reasoning is correct.
handle_mmio is for emulating and handle_page_fault is the generic fault,
which should be resolved and repeat the instruction. 
Sorry for the confusion.

Jaeyong

> 
> Ian.
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-08  9:46   ` Jaeyong Yoo
@ 2013-10-08 15:36     ` Eugene Fedotov
  2013-10-10 14:13       ` Julien Grall
  0 siblings, 1 reply; 18+ messages in thread
From: Eugene Fedotov @ 2013-10-08 15:36 UTC (permalink / raw)
  To: xen-devel

If we move necessary checking into handle_page_fault routine, so the 
resulting patch for traps.c will look more simple:

--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct 
cpu_user_regs *regs,
      const char *msg;
      int rc, level = -1;
      mmio_info_t info;
+    int page_fault = ( (dabt.dfsc & FSC_MASK) ==
+                          (FSC_FLT_PERM | FSC_3D_LEVEL) && dabt.write );

      if ( !check_conditional_instr(regs, hsr) )
      {
@@ -1334,6 +1336,13 @@ static void do_trap_data_abort_guest(struct 
cpu_user_regs *regs,
      if ( rc == -EFAULT )
          goto bad_data_abort;

+    /* domU page fault handling for guest live migration */
+    /* dabt.valid can be 0 here */
+    if ( page_fault && handle_page_fault(current->domain, info.gpa) )
+    {
+        /* Do not modify pc after page fault to repeat memory operation */
+        return;
+    }
      /* XXX: Decode the instruction if ISS is not valid */
      if ( !dabt.valid )
          goto bad_data_abort;

Will it be acceptable, or you think "else" statement looks more better?

Where handle_page_fault returns zero if dirty page tracing is not 
enabled for domain (dirty.mode==0) or address is not valid for domain 
(having memory map we check it), so MMIO and faults from dom0 are not 
handled by handle_page_fault.

The handle_page_fault routine (see  [PATCH v4 7/9] xen/arm: Implement 
virtual-linear page table for guest p2m mapping in live migration) will be:

  int handle_page_fault(struct domain *d, paddr_t addr)
{

     lpae_t *vlp2m_pte = 0;
     vaddr_t start, end;

     if (!d->arch.dirty.mode) return 0;
     /* Ensure that addr is inside guest's RAM */
     get_gma_start_end(d, &start, &end);
     if ( addr < start || addr > end ) return 0;

     vlp2m_pte = get_vlpt_3lvl_pte(addr);
     if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 )
     {
         lpae_t pte = *vlp2m_pte;
         pte.p2m.write = 1;
         write_pte(vlp2m_pte, pte);
         flush_tlb_local();

         /* in order to remove mappings in cleanup stage */
         add_mapped_vaddr(d, addr);
     }

     return 1;
}

Best,
Evgeny.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-07 15:51     ` Eugene Fedotov
@ 2013-10-10 14:10       ` Julien Grall
  0 siblings, 0 replies; 18+ messages in thread
From: Julien Grall @ 2013-10-10 14:10 UTC (permalink / raw)
  To: Eugene Fedotov; +Cc: xen-devel

On 10/07/2013 04:51 PM, Eugene Fedotov wrote:

> Let me explain. I think the difference between handle_page_fault and
> handle_mmio is that the ongoing memory operation (that trapped here)
> should be repeated after handle_page_fault (the page fault handler
> clears read-only bit), but should be stepped out after handle_mmio (to
> do this we call advance_pc to increase the pc register). So, advance_pc
> is intentionally missed.

Oh right, thanks for the explanation!

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-08 15:36     ` Eugene Fedotov
@ 2013-10-10 14:13       ` Julien Grall
  2013-10-16 13:44         ` Ian Campbell
  0 siblings, 1 reply; 18+ messages in thread
From: Julien Grall @ 2013-10-10 14:13 UTC (permalink / raw)
  To: Eugene Fedotov; +Cc: xen-devel

On 10/08/2013 04:36 PM, Eugene Fedotov wrote:
> If we move necessary checking into handle_page_fault routine, so the
> resulting patch for traps.c will look more simple:
> 
> --- a/xen/arch/arm/traps.c
> +++ b/xen/arch/arm/traps.c
> @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct
> cpu_user_regs *regs,
>      const char *msg;
>      int rc, level = -1;
>      mmio_info_t info;
> +    int page_fault = ( (dabt.dfsc & FSC_MASK) ==
> +                          (FSC_FLT_PERM | FSC_3D_LEVEL) && dabt.write );
> 
>      if ( !check_conditional_instr(regs, hsr) )
>      {
> @@ -1334,6 +1336,13 @@ static void do_trap_data_abort_guest(struct
> cpu_user_regs *regs,
>      if ( rc == -EFAULT )
>          goto bad_data_abort;
> 
> +    /* domU page fault handling for guest live migration */
> +    /* dabt.valid can be 0 here */
> +    if ( page_fault && handle_page_fault(current->domain, info.gpa) )
> +    {
> +        /* Do not modify pc after page fault to repeat memory operation */
> +        return;
> +    }
>      /* XXX: Decode the instruction if ISS is not valid */
>      if ( !dabt.valid )
>          goto bad_data_abort;
> 
> Will it be acceptable, or you think "else" statement looks more better?

I'm fine with this solution.

Cheers,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-10 14:13       ` Julien Grall
@ 2013-10-16 13:44         ` Ian Campbell
  0 siblings, 0 replies; 18+ messages in thread
From: Ian Campbell @ 2013-10-16 13:44 UTC (permalink / raw)
  To: Julien Grall; +Cc: Eugene Fedotov, xen-devel

On Thu, 2013-10-10 at 15:13 +0100, Julien Grall wrote:
> On 10/08/2013 04:36 PM, Eugene Fedotov wrote:
> > If we move necessary checking into handle_page_fault routine, so the
> > resulting patch for traps.c will look more simple:
> > 
> > --- a/xen/arch/arm/traps.c
> > +++ b/xen/arch/arm/traps.c
> > @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct
> > cpu_user_regs *regs,
> >      const char *msg;
> >      int rc, level = -1;
> >      mmio_info_t info;
> > +    int page_fault = ( (dabt.dfsc & FSC_MASK) ==
> > +                          (FSC_FLT_PERM | FSC_3D_LEVEL) && dabt.write );
> > 
> >      if ( !check_conditional_instr(regs, hsr) )
> >      {
> > @@ -1334,6 +1336,13 @@ static void do_trap_data_abort_guest(struct
> > cpu_user_regs *regs,
> >      if ( rc == -EFAULT )
> >          goto bad_data_abort;
> > 
> > +    /* domU page fault handling for guest live migration */
> > +    /* dabt.valid can be 0 here */
> > +    if ( page_fault && handle_page_fault(current->domain, info.gpa) )
> > +    {
> > +        /* Do not modify pc after page fault to repeat memory operation */
> > +        return;
> > +    }
> >      /* XXX: Decode the instruction if ISS is not valid */
> >      if ( !dabt.valid )
> >          goto bad_data_abort;
> > 
> > Will it be acceptable, or you think "else" statement looks more better?
> 
> I'm fine with this solution.

Looks good to me too, assuming you are happy with the ordering of MMIO
handling vs page faults.

> 
> Cheers,
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-04  4:44 ` [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing Jaeyong Yoo
  2013-10-07 13:02   ` Julien Grall
@ 2013-10-16 13:44   ` Ian Campbell
  2013-10-17 10:12     ` Jaeyong Yoo
  1 sibling, 1 reply; 18+ messages in thread
From: Ian Campbell @ 2013-10-16 13:44 UTC (permalink / raw)
  To: Jaeyong Yoo; +Cc: Stefano Stabellini, xen-devel

On Fri, 2013-10-04 at 13:44 +0900, Jaeyong Yoo wrote:
> Add hypercall (shadow op: enable/disable and clean/peek dirted page bitmap).

"dirtied"

> It consists of two parts: dirty page detecting and saving.
> For detecting, we setup the guest p2m's leaf PTE read-only and whenever
> the guest tries to write something, permission fault happens and traps into xen.
> The permission-faulted GPA should be saved for the toolstack (when it wants to see
> which pages are dirted). For this purpose, we temporarily save the GPAs into linked

"dirtied" again

> list by using 'add_mapped_vaddr' function and when toolstack wants
> (log_dirty_op function) the GPAs are copied into bitmap and the linnked list is flushed.

"linked"

>      spin_lock_init(&d->arch.map_lock);
>      d->arch.map_domain.nr_banks = 0;
>  
> +    /* init for dirty-page tracing */
> +    d->arch.dirty.count = 0;
> +    d->arch.dirty.mode = 0;
> +    d->arch.dirty.head = NULL;
> +    d->arch.dirty.mvn_head = NULL;
> +    spin_lock_init(&d->arch.dirty.lock);
> +
> +    d->arch.dirty.second_lvl_start = 0;
> +    d->arch.dirty.second_lvl_end = 0;
> +    d->arch.dirty.second_lvl = NULL;

I think some/all of these belong in the previous patch which added the
fields.

> +
>      clear_page(d->shared_info);
>      share_xen_page_with_guest(
>          virt_to_page(d->shared_info), d, XENSHARE_writable);
> diff --git a/xen/arch/arm/domctl.c b/xen/arch/arm/domctl.c
> index 78fb322..7edce12 100644
> --- a/xen/arch/arm/domctl.c
> +++ b/xen/arch/arm/domctl.c
> @@ -92,6 +92,19 @@ long arch_do_domctl(struct xen_domctl *domctl, struct domain *d,
>              xfree(c.data);
>      }
>      break;
> +    case XEN_DOMCTL_shadow_op:
> +    {
> +        domain_pause(d);
> +        ret = dirty_mode_op(d, &domctl->u.shadow_op);
> +        domain_unpause(d);
> +
> +        if ( (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_CLEAN ||
> +             (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_PEEK )

(&domctl->u.shadow_op)->op is just a weird way to write
domctl->u.shadow_op.op isn't it?

Do you only want to copyback on success or is it always correct?

> +        {
> +            copyback = 1;
> +        }
> +    }
> +    break;


> @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d)
>      unmap_domain_page_global(d->arch.dirty.second_lvl);
>  }
>  
> +/* routine for dirty-page tracing
> + *
> + * On first write, it page faults, its entry is changed to read-write,
> + * and on retry the write succeeds.
> + *
> + * for locating p2m of the faulting entry, we use virtual-linear page table.
> + */
> +void handle_page_fault(struct domain *d, paddr_t addr)
> +{
> +    lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr);
> +    if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 )

I think you need to distinguish p2m entries which are r/o due to log
dirty from those which are r/o due to wanting them to be r/o.

Maybe we don't have the latter right now? We will eventually need p2m
types I think.

> +    {
> +        lpae_t pte = *vlp2m_pte;
> +        pte.p2m.write = 1;
> +        write_pte(vlp2m_pte, pte);
> +        flush_tlb_local();
> +
> +        /* in order to remove mappings in cleanup stage */
> +        add_mapped_vaddr(d, addr);

No locks or atomic operations here? How are races with the tools reading
the dirty bitmap addressed?  If it is by clever ordering of the checks
and pte writes then I think a comment would be in order.

> +    }
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> index 2d09fef..3b2b11a 100644
> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -6,6 +6,8 @@
>  #include <xen/bitops.h>
>  #include <asm/flushtlb.h>
>  #include <asm/gic.h>
> +#include <xen/guest_access.h>
> +#include <xen/pfn.h>
>  
>  void dump_p2m_lookup(struct domain *d, paddr_t addr)
>  {
> @@ -408,6 +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long gpfn)
>      return p >> PAGE_SHIFT;
>  }
>  
> +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - sizeof(int))\
> +                        / sizeof(unsigned long)
> +
> +/* An array-based linked list for storing virtual addresses (i.e., the 3rd
> + * level table mapping) that should be destroyed after live migration */
> +struct mapped_va_node
> +{
> +    struct page_info *next;
> +    struct mapped_va_node *mvn_next;
> +    int items;
> +    unsigned long vaddrs[MAX_VA_PER_NODE];
> +};
> +
> +int add_mapped_vaddr(struct domain *d, unsigned long va)
> +{

This function seems awefuly expensive (contains allocations etc) for a
function called on every page fault during a migration.

Why are you remembering the full va of every fault when a single bit per
page would do?

I think you can allocate a bitmap when logdirty is enabled. At a bit per
page it shouldn't be too huge.

You might be able to encode this in the p2m which you walk when the
tools ask for the bit map (i.e. p2m.type==writeable p2m.write==0 =>
dirty), but I think a bitmap would do the trick and be easier to
implement.

> +    struct page_info *page_head;
> +    struct mapped_va_node *mvn_head;
> +
> +    spin_lock(&d->arch.dirty.lock);
> +
> +    page_head = d->arch.dirty.head;
> +    mvn_head = d->arch.dirty.mvn_head;
> +    if ( !mvn_head )
> +    {
> +        page_head = alloc_domheap_page(NULL, 0);
> +        if ( !page_head )
> +        {
> +            spin_unlock(&d->arch.dirty.lock);
> +            return -ENOMEM;
> +        }
> +
> +        mvn_head = map_domain_page_global(__page_to_mfn(page_head));
> +        mvn_head->items = 0;
> +        mvn_head->next = NULL;
> +        mvn_head->mvn_next = NULL;
> +        d->arch.dirty.head = page_head;
> +        d->arch.dirty.mvn_head = mvn_head;
> +    }
> +
> +    if ( mvn_head->items == MAX_VA_PER_NODE )
> +    {
> +        struct page_info *page;
> +
> +        page = alloc_domheap_page(NULL, 0);
> +        if ( !page )
> +        {
> +            spin_unlock(&d->arch.dirty.lock);
> +            return -ENOMEM;
> +        }
> +
> +        mvn_head = map_domain_page_global(__page_to_mfn(page));
> +        mvn_head->items = 0;
> +        mvn_head->next = d->arch.dirty.head;
> +        mvn_head->mvn_next = d->arch.dirty.mvn_head;
> +
> +        d->arch.dirty.mvn_head = mvn_head;
> +        d->arch.dirty.head = page;
> +    }
> +
> +    mvn_head->vaddrs[mvn_head->items] = va;
> +    mvn_head->items ++;
> +
> +    spin_unlock(&d->arch.dirty.lock);
> +    return 0;
> +}
> +
> +/* get the dirty bitmap from the linked list stored by add_mapped_vaddr
> + * and also clear the mapped vaddrs linked list */
> +static void get_dirty_bitmap(struct domain *d, uint8_t *bitmap[])
> +{
> +    struct page_info *page_head;
> +    struct mapped_va_node *mvn_head;
> +    struct page_info *tmp_page;
> +    struct mapped_va_node *tmp_mvn;
> +    vaddr_t gma_start = 0;
> +    vaddr_t gma_end = 0;
> +
> +    /* this hypercall is called from domain 0, and we don't know which guest's
> +     * vlpt is mapped in xen_second, so, to be sure, we restore vlpt here */
> +    restore_vlpt(d);

AFAICT you don't actually use the vlpt here, just the domains list of
dirty addresses (which as above could become a bitmap)

> +
> +    spin_lock(&d->arch.dirty.lock);
> +
> +    page_head = d->arch.dirty.head;
> +    mvn_head = d->arch.dirty.mvn_head;
> +    get_gma_start_end(d, &gma_start, &gma_end);
> +
> +    while ( page_head )
> +    {
> +        int i;
> +
> +        for ( i = 0; i < mvn_head->items; ++i )
> +        {
> +            unsigned int bit_offset;
> +            int bitmap_index, bitmap_offset;
> +            lpae_t *vlp2m_pte;
> +
> +            bit_offset = third_linear_offset(mvn_head->vaddrs[i] - gma_start);
> +            bitmap_index = bit_offset >> (PAGE_SHIFT + 3);
> +            bitmap_offset = bit_offset & ((1ul << (PAGE_SHIFT + 3)) - 1);
> +
> +            __set_bit(bitmap_offset, bitmap[bitmap_index]);
> +
> +            vlp2m_pte = get_vlpt_3lvl_pte(mvn_head->vaddrs[i]);
> +            vlp2m_pte->p2m.write = 0;
> +        }
> +
> +        tmp_page = page_head;
> +        page_head = mvn_head->next;
> +
> +        tmp_mvn = mvn_head;
> +        mvn_head = mvn_head->mvn_next;
> +
> +        unmap_domain_page_global(tmp_mvn);
> +        free_domheap_page(tmp_page);
> +    }
> +
> +    d->arch.dirty.head = NULL;
> +    d->arch.dirty.mvn_head = NULL;
> +
> +    spin_unlock(&d->arch.dirty.lock);
> +}
> +
> +
> +/* Change types across all p2m entries in a domain */
> +static void p2m_change_entry_type_global(struct domain *d, enum mg nt)
> +{

Stefano had some patches to generalise create_p2m_entries() into a p2m
walker infrastructure in his series to add XENMEM_exchange_and_pin. That
series turned out to be unnecessary but it could be resurrected for use
here rather than recoding a new one.

> +    struct p2m_domain *p2m = &d->arch.p2m;
> +    vaddr_t ram_base;
> +    int i1, i2, i3;
> +    int first_index, second_index, third_index;
> +    lpae_t *first = __map_domain_page(p2m->first_level);
> +    lpae_t pte, *second = NULL, *third = NULL;
> +
> +    get_gma_start_end(d, &ram_base, NULL);
> +
> +    first_index = first_table_offset((uint64_t)ram_base);
> +    second_index = second_table_offset((uint64_t)ram_base);
> +    third_index = third_table_offset((uint64_t)ram_base);
> +
> +    BUG_ON( !first && "Can't map first level p2m." );
> +
> +    spin_lock(&p2m->lock);
> +
> +    for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 )
> +    {
> +        lpae_walk_t first_pte = first[i1].walk;
> +        if ( !first_pte.valid || !first_pte.table )
> +            goto out;
> +
> +        second = map_domain_page(first_pte.base);
> +        BUG_ON( !second && "Can't map second level p2m.");
> +        for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 )
> +        {
> +            lpae_walk_t second_pte = second[i2].walk;
> +
> +            /* a nasty hack for vlpt due to the difference
> +             * of page table entry layout between p2m and pt */
> +            second[i2].pt.ro = 0;

What is the hack here?

The issue is that the p2m second level, which is a table entry in the
p2m is installed into the Xen page tables where it ends up the third
level, treated as a block entry.

That's OK, because for both PT and P2M bits 2..10 are ignored for table
entries. So I think rather than this hack here we should simply ensure
that our non-leaf p2m's have the correct bits in them to be used as p2m
table entries.

> +
> +            if ( !second_pte.valid || !second_pte.table )
> +                goto out;
> +
> +            third = map_domain_page(second_pte.base);
> +            BUG_ON( !third && "Can't map third level p2m.");
> +
> +            for ( i3 = third_index; i3 < LPAE_ENTRIES; ++i3 )
> +            {
> +                lpae_walk_t third_pte = third[i3].walk;
> +                int write = 0;
> +                if ( !third_pte.valid )
> +                    goto out;
> +
> +                pte = third[i3];
> +                if ( pte.p2m.write == 1 && nt == mg_ro )
> +                {
> +                    pte.p2m.write = 0;
> +                    write = 1;
> +                }
> +                else if ( pte.p2m.write == 0 && nt == mg_rw )
> +                {
> +                    pte.p2m.write = 1;
> +                    write = 1;
> +                }
> +                if ( write )
> +                    write_pte(&third[i3], pte);
> +            }
> +            unmap_domain_page(third);
> +
> +            third = NULL;
> +            third_index = 0;
> +        }
> +        unmap_domain_page(second);
> +
> +        second = NULL;
> +        second_index = 0;
> +        third_index = 0;
> +    }
> +
> +out:
> +    flush_tlb_all_local();
> +    if ( third ) unmap_domain_page(third);
> +    if ( second ) unmap_domain_page(second);
> +    if ( first ) unmap_domain_page(first);
> +
> +    spin_unlock(&p2m->lock);
> +}
> +
> +/* Read a domain's log-dirty bitmap and stats.
> + * If the operation is a CLEAN, clear the bitmap and stats. */
> +int log_dirty_op(struct domain *d, xen_domctl_shadow_op_t *sc)

Can this not be static?

> +{
> +    unsigned long gmfn_start;
> +    unsigned long gmfn_end;
> +    unsigned long gmfns;
> +    unsigned int bitmap_pages;
> +    int rc = 0, clean = 0, peek = 1;
> +    uint8_t *bitmap[256]; /* bitmap[256] covers 32GB ram */
> +    int i;
> +
> +    BUG_ON( !d->arch.map_domain.nr_banks );
> +
> +    gmfn_start = d->arch.map_domain.bank[0].start >> PAGE_SHIFT;
> +    gmfn_end = domain_get_maximum_gpfn(d);
> +    gmfns = gmfn_end - gmfn_start;
> +    bitmap_pages = PFN_UP((gmfns + 7) / 8);
> +
> +    if ( guest_handle_is_null(sc->dirty_bitmap) )
> +    {
> +        peek = 0;
> +    }
> +    else
> +    {
> +        /* mapping to the bitmap from guest param */
> +        vaddr_t to = (vaddr_t)sc->dirty_bitmap.p;

This is not allowed, please use the guest handle interfaces and
copy_(to/from)_(user/guest) instead of diving into the guest handle
struct and open coding it yourself.

This might mean creating a temporary bitmap in hypervisor space, but if
you maintain the bitmap across the whole dirty period as suggested then
you should be able to simply copy it out.

Actually, for consistency you might need an atomic copy and clear
mechanism (otherwise you can perhaps loose updates), in which case
you'll need a temporary buffer.

> +
> +        BUG_ON( to & ~PAGE_MASK && "offset not aligned to PAGE SIZE");
> +
> +        for ( i = 0; i < bitmap_pages; ++i )
> +        {
> +            paddr_t g;
> +            rc = gvirt_to_maddr(to, &g);
> +            if ( rc )
> +                return rc;
> +            bitmap[i] = map_domain_page(g>>PAGE_SHIFT);
> +            memset(bitmap[i], 0x00, PAGE_SIZE);
> +            to += PAGE_SIZE;
> +        }
> +    }
> +
> +    clean = (sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN);
> +    sc->stats.dirty_count = d->arch.dirty.count;
> +
> +    get_dirty_bitmap(d, bitmap);
> +    if ( peek )
> +    {
> +        for ( i = 0; i < bitmap_pages; ++i )
> +        {
> +            unmap_domain_page(bitmap[i]);
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +long dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc)
> +{
> +    long ret = 0;
> +    switch (sc->op)
> +    {
> +        case XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY:
> +        case XEN_DOMCTL_SHADOW_OP_OFF:
> +        {
> +            enum mg nt = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? mg_rw : mg_ro;
> +
> +            d->arch.dirty.mode = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? 0 : 1;
> +            p2m_change_entry_type_global(d, nt);
> +
> +            if ( sc->op == XEN_DOMCTL_SHADOW_OP_OFF )
> +                cleanup_vlpt(d);
> +            else
> +                prepare_vlpt(d);
> +        }
> +        break;
> +
> +        case XEN_DOMCTL_SHADOW_OP_CLEAN:
> +        case XEN_DOMCTL_SHADOW_OP_PEEK:
> +        {
> +            ret = log_dirty_op(d, sc);
> +        }
> +        break;
> +
> +        default:
> +            return -ENOSYS;
> +    }
> +    return ret;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> index 4c0fc32..3b78ed2 100644
> --- a/xen/arch/arm/traps.c
> +++ b/xen/arch/arm/traps.c
> @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>      const char *msg;
>      int rc, level = -1;
>      mmio_info_t info;
> +    int page_fault = ((dabt.dfsc & FSC_MASK) ==
> +                      (FSC_FLT_PERM + FSC_3D_LEVEL) && dabt.write);
>  
>      if ( !check_conditional_instr(regs, hsr) )
>      {
> @@ -1327,22 +1329,23 @@ static void do_trap_data_abort_guest(struct cpu_user_regs *regs,
>      info.gva = READ_SYSREG64(FAR_EL2);
>  #endif
>  
> -    if (dabt.s1ptw)
> +    if ( dabt.s1ptw && !page_fault )
>          goto bad_data_abort;

I see this has been commented on elsewhere.

Ian.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-16 13:44   ` Ian Campbell
@ 2013-10-17 10:12     ` Jaeyong Yoo
  2013-10-17 10:30       ` Ian Campbell
  0 siblings, 1 reply; 18+ messages in thread
From: Jaeyong Yoo @ 2013-10-17 10:12 UTC (permalink / raw)
  To: 'Ian Campbell'; +Cc: 'Stefano Stabellini', xen-devel

> -----Original Message-----
> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Sent: Wednesday, October 16, 2013 10:45 PM
> To: Jaeyong Yoo
> Cc: xen-devel@lists.xen.org; Stefano Stabellini
> Subject: Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for
> dirty page tracing
> 
> On Fri, 2013-10-04 at 13:44 +0900, Jaeyong Yoo wrote:
> > Add hypercall (shadow op: enable/disable and clean/peek dirted page
> bitmap).
> 
> "dirtied"
> 
> > It consists of two parts: dirty page detecting and saving.
> > For detecting, we setup the guest p2m's leaf PTE read-only and
> > whenever the guest tries to write something, permission fault happens
> and traps into xen.
> > The permission-faulted GPA should be saved for the toolstack (when it
> > wants to see which pages are dirted). For this purpose, we temporarily
> > save the GPAs into linked
> 
> "dirtied" again
> 
> > list by using 'add_mapped_vaddr' function and when toolstack wants
> > (log_dirty_op function) the GPAs are copied into bitmap and the linnked
> list is flushed.
> 
> "linked"

Oops, typos.

> 
> >      spin_lock_init(&d->arch.map_lock);
> >      d->arch.map_domain.nr_banks = 0;
> >
> > +    /* init for dirty-page tracing */
> > +    d->arch.dirty.count = 0;
> > +    d->arch.dirty.mode = 0;
> > +    d->arch.dirty.head = NULL;
> > +    d->arch.dirty.mvn_head = NULL;
> > +    spin_lock_init(&d->arch.dirty.lock);
> > +
> > +    d->arch.dirty.second_lvl_start = 0;
> > +    d->arch.dirty.second_lvl_end = 0;
> > +    d->arch.dirty.second_lvl = NULL;
> 
> I think some/all of these belong in the previous patch which added the
> fields.

Oops, right.

> 
> > +
> >      clear_page(d->shared_info);
> >      share_xen_page_with_guest(
> >          virt_to_page(d->shared_info), d, XENSHARE_writable); diff
> > --git a/xen/arch/arm/domctl.c b/xen/arch/arm/domctl.c index
> > 78fb322..7edce12 100644
> > --- a/xen/arch/arm/domctl.c
> > +++ b/xen/arch/arm/domctl.c
> > @@ -92,6 +92,19 @@ long arch_do_domctl(struct xen_domctl *domctl, struct
> domain *d,
> >              xfree(c.data);
> >      }
> >      break;
> > +    case XEN_DOMCTL_shadow_op:
> > +    {
> > +        domain_pause(d);
> > +        ret = dirty_mode_op(d, &domctl->u.shadow_op);
> > +        domain_unpause(d);
> > +
> > +        if ( (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_CLEAN ||
> > +             (&domctl->u.shadow_op)->op == XEN_DOMCTL_SHADOW_OP_PEEK
> > + )
> 
> (&domctl->u.shadow_op)->op is just a weird way to write
> domctl->u.shadow_op.op isn't it?

Right!

> 
> Do you only want to copyback on success or is it always correct?

Oh, I think I missed the else condition here.

> 
> > +        {
> > +            copyback = 1;
> > +        }
> > +    }
> > +    break;
> 
> 
> > @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d)
> >      unmap_domain_page_global(d->arch.dirty.second_lvl);
> >  }
> >
> > +/* routine for dirty-page tracing
> > + *
> > + * On first write, it page faults, its entry is changed to
> > +read-write,
> > + * and on retry the write succeeds.
> > + *
> > + * for locating p2m of the faulting entry, we use virtual-linear page
> table.
> > + */
> > +void handle_page_fault(struct domain *d, paddr_t addr) {
> > +    lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr);
> > +    if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 )
> 
> I think you need to distinguish p2m entries which are r/o due to log dirty
> from those which are r/o due to wanting them to be r/o.
> 
> Maybe we don't have the latter right now? We will eventually need p2m
> types I think.

Yes, that should be distinguished. Actually, that was in my mind and apparently
I failed to remember. For doing this, I'm thinking to use un-used field in pte
as an indicator of 'wanting to be r/o' and check this indicator in permission fault. 

> 
> > +    {
> > +        lpae_t pte = *vlp2m_pte;
> > +        pte.p2m.write = 1;
> > +        write_pte(vlp2m_pte, pte);
> > +        flush_tlb_local();
> > +
> > +        /* in order to remove mappings in cleanup stage */
> > +        add_mapped_vaddr(d, addr);
> 
> No locks or atomic operations here? How are races with the tools reading
> the dirty bitmap addressed?  If it is by clever ordering of the checks and
> pte writes then I think a comment would be in order.

Actually, the lock is inside the add_mapped_vaddr function. 

> 
> > +    }
> > +}
> > +
> >  /*
> >   * Local variables:
> >   * mode: C
> > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index
> > 2d09fef..3b2b11a 100644
> > --- a/xen/arch/arm/p2m.c
> > +++ b/xen/arch/arm/p2m.c
> > @@ -6,6 +6,8 @@
> >  #include <xen/bitops.h>
> >  #include <asm/flushtlb.h>
> >  #include <asm/gic.h>
> > +#include <xen/guest_access.h>
> > +#include <xen/pfn.h>
> >
> >  void dump_p2m_lookup(struct domain *d, paddr_t addr)  { @@ -408,6
> > +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long
> gpfn)
> >      return p >> PAGE_SHIFT;
> >  }
> >
> > +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - sizeof(int))\
> > +                        / sizeof(unsigned long)
> > +
> > +/* An array-based linked list for storing virtual addresses (i.e.,
> > +the 3rd
> > + * level table mapping) that should be destroyed after live migration
> > +*/ struct mapped_va_node {
> > +    struct page_info *next;
> > +    struct mapped_va_node *mvn_next;
> > +    int items;
> > +    unsigned long vaddrs[MAX_VA_PER_NODE]; };
> > +
> > +int add_mapped_vaddr(struct domain *d, unsigned long va) {
> 
> This function seems awefuly expensive (contains allocations etc) for a
> function called on every page fault during a migration.
> 
> Why are you remembering the full va of every fault when a single bit per
> page would do?
> 
> I think you can allocate a bitmap when logdirty is enabled. At a bit per
> page it shouldn't be too huge.
> 
> You might be able to encode this in the p2m which you walk when the tools
> ask for the bit map (i.e. p2m.type==writeable p2m.write==0 => dirty), but
> I think a bitmap would do the trick and be easier to implement.

There are three candidates:

1) bitmap
2) encoding into p2m
3) linked list of VAs. 

And, two functions for dirty-page tracing:

A) dirty-page handling (At trap)
B) getting dirty bitmap (for toolstack)

1) and 2) have advantage for A) but have to search the full range of pages at B)
to find out which page is dirtied and set the page as read-only for next dirty
detection. On the otherhand, 3) does not need to search the full range at B).
I prefer 3) due to the 'non-full range search' but I understand your concern.
Maybe I can do some measurement for each method to see which one better. 

> 
> > +    struct page_info *page_head;
> > +    struct mapped_va_node *mvn_head;
> > +
> > +    spin_lock(&d->arch.dirty.lock);
> > +
> > +    page_head = d->arch.dirty.head;
> > +    mvn_head = d->arch.dirty.mvn_head;
> > +    if ( !mvn_head )
> > +    {
> > +        page_head = alloc_domheap_page(NULL, 0);
> > +        if ( !page_head )
> > +        {
> > +            spin_unlock(&d->arch.dirty.lock);
> > +            return -ENOMEM;
> > +        }
> > +
> > +        mvn_head = map_domain_page_global(__page_to_mfn(page_head));
> > +        mvn_head->items = 0;
> > +        mvn_head->next = NULL;
> > +        mvn_head->mvn_next = NULL;
> > +        d->arch.dirty.head = page_head;
> > +        d->arch.dirty.mvn_head = mvn_head;
> > +    }
> > +
> > +    if ( mvn_head->items == MAX_VA_PER_NODE )
> > +    {
> > +        struct page_info *page;
> > +
> > +        page = alloc_domheap_page(NULL, 0);
> > +        if ( !page )
> > +        {
> > +            spin_unlock(&d->arch.dirty.lock);
> > +            return -ENOMEM;
> > +        }
> > +
> > +        mvn_head = map_domain_page_global(__page_to_mfn(page));
> > +        mvn_head->items = 0;
> > +        mvn_head->next = d->arch.dirty.head;
> > +        mvn_head->mvn_next = d->arch.dirty.mvn_head;
> > +
> > +        d->arch.dirty.mvn_head = mvn_head;
> > +        d->arch.dirty.head = page;
> > +    }
> > +
> > +    mvn_head->vaddrs[mvn_head->items] = va;
> > +    mvn_head->items ++;
> > +
> > +    spin_unlock(&d->arch.dirty.lock);
> > +    return 0;
> > +}
> > +
> > +/* get the dirty bitmap from the linked list stored by
> > +add_mapped_vaddr
> > + * and also clear the mapped vaddrs linked list */ static void
> > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) {
> > +    struct page_info *page_head;
> > +    struct mapped_va_node *mvn_head;
> > +    struct page_info *tmp_page;
> > +    struct mapped_va_node *tmp_mvn;
> > +    vaddr_t gma_start = 0;
> > +    vaddr_t gma_end = 0;
> > +
> > +    /* this hypercall is called from domain 0, and we don't know which
> guest's
> > +     * vlpt is mapped in xen_second, so, to be sure, we restore vlpt
> here */
> > +    restore_vlpt(d);
> 
> AFAICT you don't actually use the vlpt here, just the domains list of
> dirty addresses (which as above could become a bitmap)

I think vlpt is still needed because at this stage, we have to reset the 
write permission of dirtied pages. Maybe some tricks for efficiently doing this?

> 
> > +
> > +    spin_lock(&d->arch.dirty.lock);
> > +
> > +    page_head = d->arch.dirty.head;
> > +    mvn_head = d->arch.dirty.mvn_head;
> > +    get_gma_start_end(d, &gma_start, &gma_end);
> > +
> > +    while ( page_head )
> > +    {
> > +        int i;
> > +
> > +        for ( i = 0; i < mvn_head->items; ++i )
> > +        {
> > +            unsigned int bit_offset;
> > +            int bitmap_index, bitmap_offset;
> > +            lpae_t *vlp2m_pte;
> > +
> > +            bit_offset = third_linear_offset(mvn_head->vaddrs[i] -
> gma_start);
> > +            bitmap_index = bit_offset >> (PAGE_SHIFT + 3);
> > +            bitmap_offset = bit_offset & ((1ul << (PAGE_SHIFT + 3)) -
> > + 1);
> > +
> > +            __set_bit(bitmap_offset, bitmap[bitmap_index]);
> > +
> > +            vlp2m_pte = get_vlpt_3lvl_pte(mvn_head->vaddrs[i]);
> > +            vlp2m_pte->p2m.write = 0;
> > +        }
> > +
> > +        tmp_page = page_head;
> > +        page_head = mvn_head->next;
> > +
> > +        tmp_mvn = mvn_head;
> > +        mvn_head = mvn_head->mvn_next;
> > +
> > +        unmap_domain_page_global(tmp_mvn);
> > +        free_domheap_page(tmp_page);
> > +    }
> > +
> > +    d->arch.dirty.head = NULL;
> > +    d->arch.dirty.mvn_head = NULL;
> > +
> > +    spin_unlock(&d->arch.dirty.lock); }
> > +
> > +
> > +/* Change types across all p2m entries in a domain */ static void
> > +p2m_change_entry_type_global(struct domain *d, enum mg nt) {
> 
> Stefano had some patches to generalise create_p2m_entries() into a p2m
> walker infrastructure in his series to add XENMEM_exchange_and_pin. That
> series turned out to be unnecessary but it could be resurrected for use
> here rather than recoding a new one.

p2m walker infrastructure sounds interesting. 

> 
> > +    struct p2m_domain *p2m = &d->arch.p2m;
> > +    vaddr_t ram_base;
> > +    int i1, i2, i3;
> > +    int first_index, second_index, third_index;
> > +    lpae_t *first = __map_domain_page(p2m->first_level);
> > +    lpae_t pte, *second = NULL, *third = NULL;
> > +
> > +    get_gma_start_end(d, &ram_base, NULL);
> > +
> > +    first_index = first_table_offset((uint64_t)ram_base);
> > +    second_index = second_table_offset((uint64_t)ram_base);
> > +    third_index = third_table_offset((uint64_t)ram_base);
> > +
> > +    BUG_ON( !first && "Can't map first level p2m." );
> > +
> > +    spin_lock(&p2m->lock);
> > +
> > +    for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 )
> > +    {
> > +        lpae_walk_t first_pte = first[i1].walk;
> > +        if ( !first_pte.valid || !first_pte.table )
> > +            goto out;
> > +
> > +        second = map_domain_page(first_pte.base);
> > +        BUG_ON( !second && "Can't map second level p2m.");
> > +        for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 )
> > +        {
> > +            lpae_walk_t second_pte = second[i2].walk;
> > +
> > +            /* a nasty hack for vlpt due to the difference
> > +             * of page table entry layout between p2m and pt */
> > +            second[i2].pt.ro = 0;
> 
> What is the hack here?
> 
> The issue is that the p2m second level, which is a table entry in the p2m
> is installed into the Xen page tables where it ends up the third level,
> treated as a block entry.
> 
> That's OK, because for both PT and P2M bits 2..10 are ignored for table
> entries. So I think rather than this hack here we should simply ensure
> that our non-leaf p2m's have the correct bits in them to be used as p2m
> table entries.

Oh, I see. So, should I put something like this in create_p2m_entries function?

> 
> > +
> > +            if ( !second_pte.valid || !second_pte.table )
> > +                goto out;
> > +
> > +            third = map_domain_page(second_pte.base);
> > +            BUG_ON( !third && "Can't map third level p2m.");
> > +
> > +            for ( i3 = third_index; i3 < LPAE_ENTRIES; ++i3 )
> > +            {
> > +                lpae_walk_t third_pte = third[i3].walk;
> > +                int write = 0;
> > +                if ( !third_pte.valid )
> > +                    goto out;
> > +
> > +                pte = third[i3];
> > +                if ( pte.p2m.write == 1 && nt == mg_ro )
> > +                {
> > +                    pte.p2m.write = 0;
> > +                    write = 1;
> > +                }
> > +                else if ( pte.p2m.write == 0 && nt == mg_rw )
> > +                {
> > +                    pte.p2m.write = 1;
> > +                    write = 1;
> > +                }
> > +                if ( write )
> > +                    write_pte(&third[i3], pte);
> > +            }
> > +            unmap_domain_page(third);
> > +
> > +            third = NULL;
> > +            third_index = 0;
> > +        }
> > +        unmap_domain_page(second);
> > +
> > +        second = NULL;
> > +        second_index = 0;
> > +        third_index = 0;
> > +    }
> > +
> > +out:
> > +    flush_tlb_all_local();
> > +    if ( third ) unmap_domain_page(third);
> > +    if ( second ) unmap_domain_page(second);
> > +    if ( first ) unmap_domain_page(first);
> > +
> > +    spin_unlock(&p2m->lock);
> > +}
> > +
> > +/* Read a domain's log-dirty bitmap and stats.
> > + * If the operation is a CLEAN, clear the bitmap and stats. */ int
> > +log_dirty_op(struct domain *d, xen_domctl_shadow_op_t *sc)
> 
> Can this not be static?

Yes, right.

> 
> > +{
> > +    unsigned long gmfn_start;
> > +    unsigned long gmfn_end;
> > +    unsigned long gmfns;
> > +    unsigned int bitmap_pages;
> > +    int rc = 0, clean = 0, peek = 1;
> > +    uint8_t *bitmap[256]; /* bitmap[256] covers 32GB ram */
> > +    int i;
> > +
> > +    BUG_ON( !d->arch.map_domain.nr_banks );
> > +
> > +    gmfn_start = d->arch.map_domain.bank[0].start >> PAGE_SHIFT;
> > +    gmfn_end = domain_get_maximum_gpfn(d);
> > +    gmfns = gmfn_end - gmfn_start;
> > +    bitmap_pages = PFN_UP((gmfns + 7) / 8);
> > +
> > +    if ( guest_handle_is_null(sc->dirty_bitmap) )
> > +    {
> > +        peek = 0;
> > +    }
> > +    else
> > +    {
> > +        /* mapping to the bitmap from guest param */
> > +        vaddr_t to = (vaddr_t)sc->dirty_bitmap.p;
> 
> This is not allowed, please use the guest handle interfaces and
> copy_(to/from)_(user/guest) instead of diving into the guest handle struct
> and open coding it yourself.
> 
> This might mean creating a temporary bitmap in hypervisor space, but if
> you maintain the bitmap across the whole dirty period as suggested then
> you should be able to simply copy it out.

Oh, I'm seeing more advantages for using bitmap.

> 
> Actually, for consistency you might need an atomic copy and clear
> mechanism (otherwise you can perhaps loose updates), in which case you'll
> need a temporary buffer.
> 
> > +
> > +        BUG_ON( to & ~PAGE_MASK && "offset not aligned to PAGE
> > + SIZE");
> > +
> > +        for ( i = 0; i < bitmap_pages; ++i )
> > +        {
> > +            paddr_t g;
> > +            rc = gvirt_to_maddr(to, &g);
> > +            if ( rc )
> > +                return rc;
> > +            bitmap[i] = map_domain_page(g>>PAGE_SHIFT);
> > +            memset(bitmap[i], 0x00, PAGE_SIZE);
> > +            to += PAGE_SIZE;
> > +        }
> > +    }
> > +
> > +    clean = (sc->op == XEN_DOMCTL_SHADOW_OP_CLEAN);
> > +    sc->stats.dirty_count = d->arch.dirty.count;
> > +
> > +    get_dirty_bitmap(d, bitmap);
> > +    if ( peek )
> > +    {
> > +        for ( i = 0; i < bitmap_pages; ++i )
> > +        {
> > +            unmap_domain_page(bitmap[i]);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +long dirty_mode_op(struct domain *d, xen_domctl_shadow_op_t *sc) {
> > +    long ret = 0;
> > +    switch (sc->op)
> > +    {
> > +        case XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY:
> > +        case XEN_DOMCTL_SHADOW_OP_OFF:
> > +        {
> > +            enum mg nt = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? mg_rw :
> > +mg_ro;
> > +
> > +            d->arch.dirty.mode = sc->op == XEN_DOMCTL_SHADOW_OP_OFF ? 0 :
> 1;
> > +            p2m_change_entry_type_global(d, nt);
> > +
> > +            if ( sc->op == XEN_DOMCTL_SHADOW_OP_OFF )
> > +                cleanup_vlpt(d);
> > +            else
> > +                prepare_vlpt(d);
> > +        }
> > +        break;
> > +
> > +        case XEN_DOMCTL_SHADOW_OP_CLEAN:
> > +        case XEN_DOMCTL_SHADOW_OP_PEEK:
> > +        {
> > +            ret = log_dirty_op(d, sc);
> > +        }
> > +        break;
> > +
> > +        default:
> > +            return -ENOSYS;
> > +    }
> > +    return ret;
> > +}
> > +
> >  /*
> >   * Local variables:
> >   * mode: C
> > diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c index
> > 4c0fc32..3b78ed2 100644
> > --- a/xen/arch/arm/traps.c
> > +++ b/xen/arch/arm/traps.c
> > @@ -1313,6 +1313,8 @@ static void do_trap_data_abort_guest(struct
> cpu_user_regs *regs,
> >      const char *msg;
> >      int rc, level = -1;
> >      mmio_info_t info;
> > +    int page_fault = ((dabt.dfsc & FSC_MASK) ==
> > +                      (FSC_FLT_PERM + FSC_3D_LEVEL) && dabt.write);
> >
> >      if ( !check_conditional_instr(regs, hsr) )
> >      {
> > @@ -1327,22 +1329,23 @@ static void do_trap_data_abort_guest(struct
> cpu_user_regs *regs,
> >      info.gva = READ_SYSREG64(FAR_EL2);  #endif
> >
> > -    if (dabt.s1ptw)
> > +    if ( dabt.s1ptw && !page_fault )
> >          goto bad_data_abort;
> 
> I see this has been commented on elsewhere.
> 
> Ian.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-17 10:12     ` Jaeyong Yoo
@ 2013-10-17 10:30       ` Ian Campbell
  2013-10-17 11:05         ` Jaeyong Yoo
  0 siblings, 1 reply; 18+ messages in thread
From: Ian Campbell @ 2013-10-17 10:30 UTC (permalink / raw)
  To: Jaeyong Yoo; +Cc: 'Stefano Stabellini', xen-devel

On Thu, 2013-10-17 at 19:12 +0900, Jaeyong Yoo wrote:

> > > @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d)
> > >      unmap_domain_page_global(d->arch.dirty.second_lvl);
> > >  }
> > >
> > > +/* routine for dirty-page tracing
> > > + *
> > > + * On first write, it page faults, its entry is changed to
> > > +read-write,
> > > + * and on retry the write succeeds.
> > > + *
> > > + * for locating p2m of the faulting entry, we use virtual-linear page
> > table.
> > > + */
> > > +void handle_page_fault(struct domain *d, paddr_t addr) {
> > > +    lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr);
> > > +    if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 )
> > 
> > I think you need to distinguish p2m entries which are r/o due to log dirty
> > from those which are r/o due to wanting them to be r/o.
> > 
> > Maybe we don't have the latter right now? We will eventually need p2m
> > types I think.
> 
> Yes, that should be distinguished. Actually, that was in my mind and apparently
> I failed to remember. For doing this, I'm thinking to use un-used field in pte
> as an indicator of 'wanting to be r/o' and check this indicator in permission fault. 

Yes, I think we can use the 4 avail bits to store a p2m type. ISTR
discussing this before, but perhaps it wasn't with you!

> 
> > 
> > > +    {
> > > +        lpae_t pte = *vlp2m_pte;
> > > +        pte.p2m.write = 1;
> > > +        write_pte(vlp2m_pte, pte);
> > > +        flush_tlb_local();
> > > +
> > > +        /* in order to remove mappings in cleanup stage */
> > > +        add_mapped_vaddr(d, addr);
> > 
> > No locks or atomic operations here? How are races with the tools reading
> > the dirty bitmap addressed?  If it is by clever ordering of the checks and
> > pte writes then I think a comment would be in order.
> 
> Actually, the lock is inside the add_mapped_vaddr function. 

But that only covers the bitmap update, not the pte frobbing.

> 
> > 
> > > +    }
> > > +}
> > > +
> > >  /*
> > >   * Local variables:
> > >   * mode: C
> > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index
> > > 2d09fef..3b2b11a 100644
> > > --- a/xen/arch/arm/p2m.c
> > > +++ b/xen/arch/arm/p2m.c
> > > @@ -6,6 +6,8 @@
> > >  #include <xen/bitops.h>
> > >  #include <asm/flushtlb.h>
> > >  #include <asm/gic.h>
> > > +#include <xen/guest_access.h>
> > > +#include <xen/pfn.h>
> > >
> > >  void dump_p2m_lookup(struct domain *d, paddr_t addr)  { @@ -408,6
> > > +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long
> > gpfn)
> > >      return p >> PAGE_SHIFT;
> > >  }
> > >
> > > +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - sizeof(int))\
> > > +                        / sizeof(unsigned long)
> > > +
> > > +/* An array-based linked list for storing virtual addresses (i.e.,
> > > +the 3rd
> > > + * level table mapping) that should be destroyed after live migration
> > > +*/ struct mapped_va_node {
> > > +    struct page_info *next;
> > > +    struct mapped_va_node *mvn_next;
> > > +    int items;
> > > +    unsigned long vaddrs[MAX_VA_PER_NODE]; };
> > > +
> > > +int add_mapped_vaddr(struct domain *d, unsigned long va) {
> > 
> > This function seems awefuly expensive (contains allocations etc) for a
> > function called on every page fault during a migration.
> > 
> > Why are you remembering the full va of every fault when a single bit per
> > page would do?
> > 
> > I think you can allocate a bitmap when logdirty is enabled. At a bit per
> > page it shouldn't be too huge.
> > 
> > You might be able to encode this in the p2m which you walk when the tools
> > ask for the bit map (i.e. p2m.type==writeable p2m.write==0 => dirty), but
> > I think a bitmap would do the trick and be easier to implement.
> 
> There are three candidates:
> 
> 1) bitmap
> 2) encoding into p2m
> 3) linked list of VAs. 
> 
> And, two functions for dirty-page tracing:
> 
> A) dirty-page handling (At trap)
> B) getting dirty bitmap (for toolstack)
> 
> 1) and 2) have advantage for A) but have to search the full range of pages at B)
> to find out which page is dirtied and set the page as read-only for next dirty
> detection. On the otherhand, 3) does not need to search the full range at B).
> I prefer 3) due to the 'non-full range search' but I understand your concern.
> Maybe I can do some measurement for each method to see which one better. 

If you have a bitmap then you can do a scan for each set bit, and the
offset of each bit corresponds to the guest pfn, which allows you to
directly calculate the address of the pte in the vlpt.

I don't think doing a scan for each set bit is going to compare
unfavourably to walking a linked list. Certainly not once the memory
allocator gets involved.

> > > +/* get the dirty bitmap from the linked list stored by
> > > +add_mapped_vaddr
> > > + * and also clear the mapped vaddrs linked list */ static void
> > > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) {
> > > +    struct page_info *page_head;
> > > +    struct mapped_va_node *mvn_head;
> > > +    struct page_info *tmp_page;
> > > +    struct mapped_va_node *tmp_mvn;
> > > +    vaddr_t gma_start = 0;
> > > +    vaddr_t gma_end = 0;
> > > +
> > > +    /* this hypercall is called from domain 0, and we don't know which
> > guest's
> > > +     * vlpt is mapped in xen_second, so, to be sure, we restore vlpt
> > here */
> > > +    restore_vlpt(d);
> > 
> > AFAICT you don't actually use the vlpt here, just the domains list of
> > dirty addresses (which as above could become a bitmap)
> 
> I think vlpt is still needed because at this stage, we have to reset the 
> write permission of dirtied pages. Maybe some tricks for efficiently doing this?

Ah, of course.

Do you need to be careful about the context switch of a dom0 vcpu which
has a foreign vlpt loaded? I suppose you unload it at the end of this
function so it is safe because Xen doesn't preempt vcpus in hypervisor
mode.

> > 
> > > +    struct p2m_domain *p2m = &d->arch.p2m;
> > > +    vaddr_t ram_base;
> > > +    int i1, i2, i3;
> > > +    int first_index, second_index, third_index;
> > > +    lpae_t *first = __map_domain_page(p2m->first_level);
> > > +    lpae_t pte, *second = NULL, *third = NULL;
> > > +
> > > +    get_gma_start_end(d, &ram_base, NULL);
> > > +
> > > +    first_index = first_table_offset((uint64_t)ram_base);
> > > +    second_index = second_table_offset((uint64_t)ram_base);
> > > +    third_index = third_table_offset((uint64_t)ram_base);
> > > +
> > > +    BUG_ON( !first && "Can't map first level p2m." );
> > > +
> > > +    spin_lock(&p2m->lock);
> > > +
> > > +    for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 )
> > > +    {
> > > +        lpae_walk_t first_pte = first[i1].walk;
> > > +        if ( !first_pte.valid || !first_pte.table )
> > > +            goto out;
> > > +
> > > +        second = map_domain_page(first_pte.base);
> > > +        BUG_ON( !second && "Can't map second level p2m.");
> > > +        for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 )
> > > +        {
> > > +            lpae_walk_t second_pte = second[i2].walk;
> > > +
> > > +            /* a nasty hack for vlpt due to the difference
> > > +             * of page table entry layout between p2m and pt */
> > > +            second[i2].pt.ro = 0;
> > 
> > What is the hack here?
> > 
> > The issue is that the p2m second level, which is a table entry in the p2m
> > is installed into the Xen page tables where it ends up the third level,
> > treated as a block entry.
> > 
> > That's OK, because for both PT and P2M bits 2..10 are ignored for table
> > entries. So I think rather than this hack here we should simply ensure
> > that our non-leaf p2m's have the correct bits in them to be used as p2m
> > table entries.
> 
> Oh, I see. So, should I put something like this in create_p2m_entries function?

Probably p2m_create_table or maybe mfn_to_p2m entry would be the right
place.

Ian.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-17 10:30       ` Ian Campbell
@ 2013-10-17 11:05         ` Jaeyong Yoo
  2013-10-17 11:47           ` Ian Campbell
  0 siblings, 1 reply; 18+ messages in thread
From: Jaeyong Yoo @ 2013-10-17 11:05 UTC (permalink / raw)
  To: 'Ian Campbell'; +Cc: 'Stefano Stabellini', xen-devel

> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> bounces@lists.xen.org] On Behalf Of Ian Campbell
> Sent: Thursday, October 17, 2013 7:30 PM
> To: Jaeyong Yoo
> Cc: 'Stefano Stabellini'; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for
> dirty page tracing
> 
> On Thu, 2013-10-17 at 19:12 +0900, Jaeyong Yoo wrote:
> 
> > > > @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d)
> > > >      unmap_domain_page_global(d->arch.dirty.second_lvl);
> > > >  }
> > > >
> > > > +/* routine for dirty-page tracing
> > > > + *
> > > > + * On first write, it page faults, its entry is changed to
> > > > +read-write,
> > > > + * and on retry the write succeeds.
> > > > + *
> > > > + * for locating p2m of the faulting entry, we use virtual-linear
> > > > +page
> > > table.
> > > > + */
> > > > +void handle_page_fault(struct domain *d, paddr_t addr) {
> > > > +    lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr);
> > > > +    if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 )
> > >
> > > I think you need to distinguish p2m entries which are r/o due to log
> > > dirty from those which are r/o due to wanting them to be r/o.
> > >
> > > Maybe we don't have the latter right now? We will eventually need
> > > p2m types I think.
> >
> > Yes, that should be distinguished. Actually, that was in my mind and
> > apparently I failed to remember. For doing this, I'm thinking to use
> > un-used field in pte as an indicator of 'wanting to be r/o' and check
> this indicator in permission fault.
> 
> Yes, I think we can use the 4 avail bits to store a p2m type. ISTR
> discussing this before, but perhaps it wasn't with you!
> 
> >
> > >
> > > > +    {
> > > > +        lpae_t pte = *vlp2m_pte;
> > > > +        pte.p2m.write = 1;
> > > > +        write_pte(vlp2m_pte, pte);
> > > > +        flush_tlb_local();
> > > > +
> > > > +        /* in order to remove mappings in cleanup stage */
> > > > +        add_mapped_vaddr(d, addr);
> > >
> > > No locks or atomic operations here? How are races with the tools
> > > reading the dirty bitmap addressed?  If it is by clever ordering of
> > > the checks and pte writes then I think a comment would be in order.
> >
> > Actually, the lock is inside the add_mapped_vaddr function.
> 
> But that only covers the bitmap update, not the pte frobbing.

I also locked with the same lock at the get_dirty_bitmap which is reading
the bitmap.

> 
> >
> > >
> > > > +    }
> > > > +}
> > > > +
> > > >  /*
> > > >   * Local variables:
> > > >   * mode: C
> > > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index
> > > > 2d09fef..3b2b11a 100644
> > > > --- a/xen/arch/arm/p2m.c
> > > > +++ b/xen/arch/arm/p2m.c
> > > > @@ -6,6 +6,8 @@
> > > >  #include <xen/bitops.h>
> > > >  #include <asm/flushtlb.h>
> > > >  #include <asm/gic.h>
> > > > +#include <xen/guest_access.h>
> > > > +#include <xen/pfn.h>
> > > >
> > > >  void dump_p2m_lookup(struct domain *d, paddr_t addr)  { @@ -408,6
> > > > +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned
> > > > +long
> > > gpfn)
> > > >      return p >> PAGE_SHIFT;
> > > >  }
> > > >
> > > > +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) -
> sizeof(int))\
> > > > +                        / sizeof(unsigned long)
> > > > +
> > > > +/* An array-based linked list for storing virtual addresses
> > > > +(i.e., the 3rd
> > > > + * level table mapping) that should be destroyed after live
> > > > +migration */ struct mapped_va_node {
> > > > +    struct page_info *next;
> > > > +    struct mapped_va_node *mvn_next;
> > > > +    int items;
> > > > +    unsigned long vaddrs[MAX_VA_PER_NODE]; };
> > > > +
> > > > +int add_mapped_vaddr(struct domain *d, unsigned long va) {
> > >
> > > This function seems awefuly expensive (contains allocations etc) for
> > > a function called on every page fault during a migration.
> > >
> > > Why are you remembering the full va of every fault when a single bit
> > > per page would do?
> > >
> > > I think you can allocate a bitmap when logdirty is enabled. At a bit
> > > per page it shouldn't be too huge.
> > >
> > > You might be able to encode this in the p2m which you walk when the
> > > tools ask for the bit map (i.e. p2m.type==writeable p2m.write==0 =>
> > > dirty), but I think a bitmap would do the trick and be easier to
> implement.
> >
> > There are three candidates:
> >
> > 1) bitmap
> > 2) encoding into p2m
> > 3) linked list of VAs.
> >
> > And, two functions for dirty-page tracing:
> >
> > A) dirty-page handling (At trap)
> > B) getting dirty bitmap (for toolstack)
> >
> > 1) and 2) have advantage for A) but have to search the full range of
> > pages at B) to find out which page is dirtied and set the page as
> > read-only for next dirty detection. On the otherhand, 3) does not need
> to search the full range at B).
> > I prefer 3) due to the 'non-full range search' but I understand your
> concern.
> > Maybe I can do some measurement for each method to see which one better.
> 
> If you have a bitmap then you can do a scan for each set bit, and the
> offset of each bit corresponds to the guest pfn, which allows you to
> directly calculate the address of the pte in the vlpt.
> 
> I don't think doing a scan for each set bit is going to compare
> unfavourably to walking a linked list. Certainly not once the memory
> allocator gets involved.

Oh, do you mean using something like "find_next_zero_bit"? At the first
time,
I thought searching every bit-by-bit but, I just see findbit.S and it looks
quite optimized. 

> 
> > > > +/* get the dirty bitmap from the linked list stored by
> > > > +add_mapped_vaddr
> > > > + * and also clear the mapped vaddrs linked list */ static void
> > > > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) {
> > > > +    struct page_info *page_head;
> > > > +    struct mapped_va_node *mvn_head;
> > > > +    struct page_info *tmp_page;
> > > > +    struct mapped_va_node *tmp_mvn;
> > > > +    vaddr_t gma_start = 0;
> > > > +    vaddr_t gma_end = 0;
> > > > +
> > > > +    /* this hypercall is called from domain 0, and we don't know
> > > > + which
> > > guest's
> > > > +     * vlpt is mapped in xen_second, so, to be sure, we restore
> > > > + vlpt
> > > here */
> > > > +    restore_vlpt(d);
> > >
> > > AFAICT you don't actually use the vlpt here, just the domains list
> > > of dirty addresses (which as above could become a bitmap)
> >
> > I think vlpt is still needed because at this stage, we have to reset
> > the write permission of dirtied pages. Maybe some tricks for efficiently
> doing this?
> 
> Ah, of course.
> 
> Do you need to be careful about the context switch of a dom0 vcpu which
> has a foreign vlpt loaded? I suppose you unload it at the end of this
> function so it is safe because Xen doesn't preempt vcpus in hypervisor
> mode.


That is my concern now. For safety, unloading is necessary but it requires
flushing vlpt which is not a cheap operation.

Actually, there is one more concern:
Since we only context switch the vlpt area with 'migrating domains' when two
domains (one migrating and the other not) are context switched, the
non-migrating
domain have access to the foreign vlpt. In order to prevent the access of
foreign vlpt area, do you think unloading of vlpt when switching into
non-migrating
domains also necessary?

> 
> > >
> > > > +    struct p2m_domain *p2m = &d->arch.p2m;
> > > > +    vaddr_t ram_base;
> > > > +    int i1, i2, i3;
> > > > +    int first_index, second_index, third_index;
> > > > +    lpae_t *first = __map_domain_page(p2m->first_level);
> > > > +    lpae_t pte, *second = NULL, *third = NULL;
> > > > +
> > > > +    get_gma_start_end(d, &ram_base, NULL);
> > > > +
> > > > +    first_index = first_table_offset((uint64_t)ram_base);
> > > > +    second_index = second_table_offset((uint64_t)ram_base);
> > > > +    third_index = third_table_offset((uint64_t)ram_base);
> > > > +
> > > > +    BUG_ON( !first && "Can't map first level p2m." );
> > > > +
> > > > +    spin_lock(&p2m->lock);
> > > > +
> > > > +    for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 )
> > > > +    {
> > > > +        lpae_walk_t first_pte = first[i1].walk;
> > > > +        if ( !first_pte.valid || !first_pte.table )
> > > > +            goto out;
> > > > +
> > > > +        second = map_domain_page(first_pte.base);
> > > > +        BUG_ON( !second && "Can't map second level p2m.");
> > > > +        for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 )
> > > > +        {
> > > > +            lpae_walk_t second_pte = second[i2].walk;
> > > > +
> > > > +            /* a nasty hack for vlpt due to the difference
> > > > +             * of page table entry layout between p2m and pt */
> > > > +            second[i2].pt.ro = 0;
> > >
> > > What is the hack here?
> > >
> > > The issue is that the p2m second level, which is a table entry in
> > > the p2m is installed into the Xen page tables where it ends up the
> > > third level, treated as a block entry.
> > >
> > > That's OK, because for both PT and P2M bits 2..10 are ignored for
> > > table entries. So I think rather than this hack here we should
> > > simply ensure that our non-leaf p2m's have the correct bits in them
> > > to be used as p2m table entries.
> >
> > Oh, I see. So, should I put something like this in create_p2m_entries
> function?
> 
> Probably p2m_create_table or maybe mfn_to_p2m entry would be the right
> place.

OK.

> 
> Ian.
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-17 11:05         ` Jaeyong Yoo
@ 2013-10-17 11:47           ` Ian Campbell
  2013-10-18  4:17             ` Jaeyong Yoo
  0 siblings, 1 reply; 18+ messages in thread
From: Ian Campbell @ 2013-10-17 11:47 UTC (permalink / raw)
  To: Jaeyong Yoo; +Cc: 'Stefano Stabellini', xen-devel

On Thu, 2013-10-17 at 20:05 +0900, Jaeyong Yoo wrote:
> > >
> > > >
> > > > > +    {
> > > > > +        lpae_t pte = *vlp2m_pte;
> > > > > +        pte.p2m.write = 1;
> > > > > +        write_pte(vlp2m_pte, pte);
> > > > > +        flush_tlb_local();
> > > > > +
> > > > > +        /* in order to remove mappings in cleanup stage */
> > > > > +        add_mapped_vaddr(d, addr);
> > > >
> > > > No locks or atomic operations here? How are races with the tools
> > > > reading the dirty bitmap addressed?  If it is by clever ordering of
> > > > the checks and pte writes then I think a comment would be in order.
> > >
> > > Actually, the lock is inside the add_mapped_vaddr function.
> > 
> > But that only covers the bitmap update, not the pte frobbing.
> 
> I also locked with the same lock at the get_dirty_bitmap which is reading
> the bitmap.

I meant the pte frobbing immediately before the call the
add_mapped_vaddr quoted above.

> > > There are three candidates:
> > >
> > > 1) bitmap
> > > 2) encoding into p2m
> > > 3) linked list of VAs.
> > >
> > > And, two functions for dirty-page tracing:
> > >
> > > A) dirty-page handling (At trap)
> > > B) getting dirty bitmap (for toolstack)
> > >
> > > 1) and 2) have advantage for A) but have to search the full range of
> > > pages at B) to find out which page is dirtied and set the page as
> > > read-only for next dirty detection. On the otherhand, 3) does not need
> > to search the full range at B).
> > > I prefer 3) due to the 'non-full range search' but I understand your
> > concern.
> > > Maybe I can do some measurement for each method to see which one better.
> > 
> > If you have a bitmap then you can do a scan for each set bit, and the
> > offset of each bit corresponds to the guest pfn, which allows you to
> > directly calculate the address of the pte in the vlpt.
> > 
> > I don't think doing a scan for each set bit is going to compare
> > unfavourably to walking a linked list. Certainly not once the memory
> > allocator gets involved.
> 
> Oh, do you mean using something like "find_next_zero_bit"?

find_next_bit probably since you want to find the 1s. I suppose you
could also initialise to all 1 and clear bits instead

>  At the first
> time,
> I thought searching every bit-by-bit but, I just see findbit.S and it looks
> quite optimized. 

Yes, I think it should be fast enough.
 
> > > > > +/* get the dirty bitmap from the linked list stored by
> > > > > +add_mapped_vaddr
> > > > > + * and also clear the mapped vaddrs linked list */ static void
> > > > > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) {
> > > > > +    struct page_info *page_head;
> > > > > +    struct mapped_va_node *mvn_head;
> > > > > +    struct page_info *tmp_page;
> > > > > +    struct mapped_va_node *tmp_mvn;
> > > > > +    vaddr_t gma_start = 0;
> > > > > +    vaddr_t gma_end = 0;
> > > > > +
> > > > > +    /* this hypercall is called from domain 0, and we don't know
> > > > > + which
> > > > guest's
> > > > > +     * vlpt is mapped in xen_second, so, to be sure, we restore
> > > > > + vlpt
> > > > here */
> > > > > +    restore_vlpt(d);
> > > >
> > > > AFAICT you don't actually use the vlpt here, just the domains list
> > > > of dirty addresses (which as above could become a bitmap)
> > >
> > > I think vlpt is still needed because at this stage, we have to reset
> > > the write permission of dirtied pages. Maybe some tricks for efficiently
> > doing this?
> > 
> > Ah, of course.
> > 
> > Do you need to be careful about the context switch of a dom0 vcpu which
> > has a foreign vlpt loaded? I suppose you unload it at the end of this
> > function so it is safe because Xen doesn't preempt vcpus in hypervisor
> > mode.
> 
> 
> That is my concern now. For safety, unloading is necessary but it requires
> flushing vlpt which is not a cheap operation.

I suppose it could be deferred either to the next load of a vplt, if it
is a different one, or the next context switch away from the vcpu. You'd
just need to track which foreign vlpt was currently loaded on the vcpu
in vcpu->arch.something?

> Actually, there is one more concern:
> Since we only context switch the vlpt area with 'migrating domains' when two
> domains (one migrating and the other not) are context switched, the
> non-migrating
> domain have access to the foreign vlpt. In order to prevent the access of
> foreign vlpt area, do you think unloading of vlpt when switching into
> non-migrating
> domains also necessary?

The domain itself doesn't have access to the vlpt since it is mapped in
hypervisor context so the only danger is that Xen accidentally uses the
vlpt for something while the non-migrating domain is scheduled. I think
we should just not write that bug ;-)

IOW I think flushing the vlpt lazily is OK.

Ian.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-17 11:47           ` Ian Campbell
@ 2013-10-18  4:17             ` Jaeyong Yoo
  2013-10-18  9:11               ` Ian Campbell
  0 siblings, 1 reply; 18+ messages in thread
From: Jaeyong Yoo @ 2013-10-18  4:17 UTC (permalink / raw)
  To: 'Ian Campbell'; +Cc: 'Stefano Stabellini', xen-devel

> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> bounces@lists.xen.org] On Behalf Of Ian Campbell
> Sent: Thursday, October 17, 2013 8:47 PM
> To: Jaeyong Yoo
> Cc: 'Stefano Stabellini'; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for
> dirty page tracing
> 
> On Thu, 2013-10-17 at 20:05 +0900, Jaeyong Yoo wrote:
> > > >
> > > > >
> > > > > > +    {
> > > > > > +        lpae_t pte = *vlp2m_pte;
> > > > > > +        pte.p2m.write = 1;
> > > > > > +        write_pte(vlp2m_pte, pte);
> > > > > > +        flush_tlb_local();
> > > > > > +
> > > > > > +        /* in order to remove mappings in cleanup stage */
> > > > > > +        add_mapped_vaddr(d, addr);
> > > > >
> > > > > No locks or atomic operations here? How are races with the tools
> > > > > reading the dirty bitmap addressed?  If it is by clever ordering
> > > > > of the checks and pte writes then I think a comment would be in
> order.
> > > >
> > > > Actually, the lock is inside the add_mapped_vaddr function.
> > >
> > > But that only covers the bitmap update, not the pte frobbing.
> >
> > I also locked with the same lock at the get_dirty_bitmap which is
> > reading the bitmap.
> 
> I meant the pte frobbing immediately before the call the add_mapped_vaddr
> quoted above.
> 

I don't think it is necessary because at the moment the dirty bitmap
is constructed based on the VAs in the linked list. If the PTE frobbing
happens
immediately before add_mapped_vaddr, the corresponding dirty-page would be 
checked at the next round get-dirty-gitmap. 

Now, I start thinking of using bitmap rather than linked list, and I think
it
is similar there. The critical section would be the one that writes into the
bitmap and the one that reads the bitmap (copy to the bitmap from
toolstack).

Jaeyong

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
  2013-10-18  4:17             ` Jaeyong Yoo
@ 2013-10-18  9:11               ` Ian Campbell
  0 siblings, 0 replies; 18+ messages in thread
From: Ian Campbell @ 2013-10-18  9:11 UTC (permalink / raw)
  To: Jaeyong Yoo; +Cc: 'Stefano Stabellini', xen-devel

On Fri, 2013-10-18 at 13:17 +0900, Jaeyong Yoo wrote:

> > I meant the pte frobbing immediately before the call the add_mapped_vaddr
> > quoted above.
> > 
> 
> I don't think it is necessary because at the moment the dirty bitmap
> is constructed based on the VAs in the linked list. If the PTE frobbing
> happens
> immediately before add_mapped_vaddr, the corresponding dirty-page would be 
> checked at the next round get-dirty-gitmap. 

OK, thanks. Can we get a code comment on the ordering constraints here
please.

>  
> 
> Now, I start thinking of using bitmap rather than linked list, and I think
> it
> is similar there. The critical section would be the one that writes into the
> bitmap and the one that reads the bitmap (copy to the bitmap from
> toolstack).
> 
> Jaeyong
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing
@ 2013-10-20  4:22 Jaeyong Yoo
  0 siblings, 0 replies; 18+ messages in thread
From: Jaeyong Yoo @ 2013-10-20  4:22 UTC (permalink / raw)
  To: Ian Campbell; +Cc: 'Stefano Stabellini', xen-devel@lists.xen.org

> > > I meant the pte frobbing immediately before the call the add_mapped_vaddr
> > > quoted above.
> > > 
> > 
> > I don't think it is necessary because at the moment the dirty bitmap
> > is constructed based on the VAs in the linked list. If the PTE frobbing
> > happens
> > immediately before add_mapped_vaddr, the corresponding dirty-page would be 
> > checked at the next round get-dirty-gitmap. 
>
> OK, thanks. Can we get a code comment on the ordering constraints her
please.

Sure, of course.

Jaeyong.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-10-20  4:22 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-08  6:29 [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing Jaeyong Yoo
2013-10-08  8:46 ` Ian Campbell
2013-10-08  9:46   ` Jaeyong Yoo
2013-10-08 15:36     ` Eugene Fedotov
2013-10-10 14:13       ` Julien Grall
2013-10-16 13:44         ` Ian Campbell
  -- strict thread matches above, loose matches on Subject: below --
2013-10-20  4:22 Jaeyong Yoo
2013-10-04  4:43 [PATCH v4 0/9] xen/arm: live migration support in arndale board Jaeyong Yoo
2013-10-04  4:44 ` [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing Jaeyong Yoo
2013-10-07 13:02   ` Julien Grall
2013-10-07 15:51     ` Eugene Fedotov
2013-10-10 14:10       ` Julien Grall
2013-10-16 13:44   ` Ian Campbell
2013-10-17 10:12     ` Jaeyong Yoo
2013-10-17 10:30       ` Ian Campbell
2013-10-17 11:05         ` Jaeyong Yoo
2013-10-17 11:47           ` Ian Campbell
2013-10-18  4:17             ` Jaeyong Yoo
2013-10-18  9:11               ` Ian Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).